Month: January 2022

Verbs in Danish Dependency Treebank

Posted on

The Danish Dependency Treebank (DDT) v. 1.0 was made by Matthias Trautner Kromann (Scholia) at Copenhagen Business School from 2002 to 2004 according to its README file. It is based on texts from the PAROLE corpus collected by Ole Norling-Christensen (Scholia) from 1983 to 1992. DDT’s XML files are available with the data.

Quick and dirty grepping in DDT on one of the XML files with

grep --text 'msd="V' ddt-1.0.tag | wc

reports 15.597 verbs in the dataset.

A bit more counting with

grep --text 'msd="V' ddt-1.0.tag | python3 -c "print('\n'.join(line.split('\"')[1] for line in open(0, encoding='iso-8859-1').readlines()))" | sort | uniq -c | sort -nr | wc

shows 1.551 unique verb lemmas. With the words written to a file with

… | uniq | sort > ddt-verb-lemmas.txt

a sample of these are:

åbne, accelerere, acceptere, administrere, adskille, advare, ændre, ærgre, ætse, afblæse, afbryde, afdække, affærdige, affinde

How many of these lemmas are in Wikidata which currently have over 2.900 Danish verbs?

ddt_verb_lemmas = set(open('ddt-verb-lemmas.txt').read().split())

query = """
SELECT ?lemma {
  ?lexeme dct:language wd:Q9035 ;
          wikibase:lemma ?lemma ;
          wikibase:lexicalCategory wd:Q24905 .
}
"""
import requests
url = "https://query.wikidata.org/sparql"
data = {'query': query, 'format': 'json'}
response = requests.get(url, data)
data = response.json()['results']['bindings']
wikidata_verb_lemmas = set([datum['lemma']['value'] for datum in data])

len(ddt_verb_lemmas - wikidata_verb_lemmas)

gives 269 missing verbs in Wikidata:

affærdige, barrikadere, bedyre, bekomme, belejre, belægge, belæsse, bemale, bemyndige, berettige, berømme, bestjæle, bestorme, betræde, beære, bilægge, bivåne, blusse, blødgøre, brodere, budgettere, bugne, defilere, demoralisere, deportere, destillere, destruere, detronisere, diktere, doble, drapere, dvæle, ekskludere, ekstraindkalde, fabrikslukke, fartbegrænse, fastansætte, fejl-operere, fetere, filosofere, fin-sortere, flade, flagre, flakke, flakse, flamme, flintre, flirte, focusere, foragte, forankre, foranledige, forbeholde, fordufte, forføre, forkalke, forlige, forlise, forlove, formilde, forpligtige, forrykke, forskyde, fortrække, fortvivle, forudsige, fradømme, fralægge, frankere, fraråde, fraskrive, frasortere, fratræde, fremholde, frikende, friske, frustrere, fuldføre, fusionere, fyge, fænge, geare, gennembanke, gennemopleve, gennemprøve, genoptrykke, gestikulere, gispe, gjalde, glatte, gløde, gravere, grunde, gyse, hage, havne, hefte, hegle, henholde, henkaste, hentyde, henvide, humpe, huse, hverve, hvine, hvisle, hyre, hæge, hærde, iblande, illudere, improvisere, indgribe, indhylle, indhøste, indkassere, indkvartere, indoktrinere, indordne, indpakke, indskyde, indsmugle, indstemme, indstifte, indvie, indvælge, jævnføre, kanonere, kante, kikke, kime, kimse, knirke, knuge, kolportere, kommandere, kompromittere, konkretisere, kriminalisere, krone, kue, kuldsejle, kvie, legalisere, lirke, lue, lædere, lække, læsetræne, læspe, løje, løsne, løsrive, medfølge, mestre, mishandle, modsætte, mæske, mønstre, mørkelægge, narre, neddæmpe, nedsable, nedsænke, nedværdige, nidstirre, niveaudele, nyinvestere, nødlande, oparbejde, opdigte, opfange, opfølge, ophidse, oplagre, oplive, opstøve, opsøge, optræne, opøve, overbeglo, overfylde, overhælde, overrende, overrumple, overskrue, overvurdere, passivisere, pervertere, plette, plombere, praktisere, prissætte, profitere, programmere, pryde, præferere, prøvekøre, puffe, pulverisere, påbyde, påklage, pønse, rafle, rappe, rasere, ratificere, rivalisere, rumstere, røbe, røge, sagsøge, sammenstille, sammenstykke, sammentømre, sample, samstemme, scanne, ses, signere, simre, skeje, skippe, sladre, slentre, slynge, sløve, smashe, smugle, snage, spekulere, spraye, stakke, stationere, sukke, surmule, svimle, sønderlemme, tackle, tangere, taxie, tilbede, tildanne, tilsende, tilstå, time, tippe, tjekkere, toppe, trisse, trone, trygle, tv-annoncere, ty, tøffe, udfritte, udglatte, udlære, udmærke, udskære, udstyre, udtørre, ulovliggøre, vandkæmme, virkeliggøre, våge, vænne, værdsætte

Some of these are homonyms with other words: flade, flamme, friske, glatte, grunde, hage, havne, huse, krone, mestre, mønstre, time, toppe.

Three have alternating lemma: indvi/indvie, kigge/kikke, tackle/takle. These are already in Wikidata, so actually only 266 verbs are missing in Wikidata.

One is an independent deponent verb: ses

There are at least two errors in DDT: tjekkere, henvide. They occur in the sentences “Der er også radikale tjekkere, som må ændring holdning.” and “Det så vældig godt ud, der var også udleveret henved 100 fakler.” So 264 missing verbs in Wikidata.

Unusual Danish words

Posted on Updated on

ae (to caress): Two vowels word.

bagagebærer (bicycle luggage carrier): Compound of bagage+bærer. The part of the bike that can carry goods. The second term when viewed in isolation a (human) nomen agentis, but when it is a compound it is non-human.

bringe (the verb to bring): Danish conjugation put suffixes or for some change a vowel. For this word both a vowel is changed and a consonant disappears for some conjugated forms (bringe, bringer, bragte, bragt).

bærebar (portable): compound adjective with two parts from the same Indo-European root.

fise (to fart): verb with three different conjugation types. For instance, active voice preterite may be fiste, fisede or fes.

fyr (guy, pine tree, heating unit/light tower): three different noun lexemes. There is also a verb with a conjugation that results in “fyr”.

led: a representation associated with many lexemes. Ordia shows 9 different representations from six different lexemes: mean, suffered, suffer, search, joint, joints, side, gate, gates.

hænge (hang): Two types of conjugations that determines whether the verb is transitive or intransitive.

højttaler/højtaler (loudspeaker): One of the few Danish lexemes with multiple lemmas: an extra “t”.

institutionalisere (institutionalize): Perhaps the longest Danish word with one root. Though the Latin origin is from in- and statuo according to English Wiktionary. Other words of this kind are rekonteksualisere and repræsentantskab.

menneskerettighedsorganisation (human rights organization): The longest Danish word in common usage. It appears in Den Danske Ordbog.

ondskabsfuldhed (malice): long Danish word from one root and three derivation: adjective to noun to adjective to noun.

rødhals (Erithacus rubecula): Compound of rød and hals, red neck. It is not a hyponym of neck, but a particular bird species. Many Danish plant and animal names overrule the pattern that says that the first compound part specifies a type of the second compound.

sandkassesand (sandbox sand): Compound noun with the same noun twice (sand+kasse)+sand. This is a common product, see, e.g., certified “sandkassesand” from Coop.

service (tableware, service): A loanword that has come in twice, once from French, once from English. They have different grammatical gender. The French loanword is neutrum, while the English is common.

sigte (aim, charge, sieve): A lemma with three different verb lexemes. There are furthermore two related noun lexemes (aim, sieve).

skrivelse (a writing, note): A verbal noun that is “innexual”. Usually Danish verbal nouns are denoting an activity. Skrivelse is not the only one of this kind.

værdipapirfinansieringstransaktionseksponering (securities financing transaction exposures): Danish word identified as the longest word yet found in authoritative text. It appears in European Union legislation. : “1.9.3. Endvidere bør de nuværende regler i kapitalkravsforordningen til fastsættelse af løbetidsparametret udstrækkes til at omfatte derivat- og værdipapirfinansieringstransaktionseksponeringer samt transaktioner uden fast løbetid.”

øl (beer): Can have two genders (common and neutrum) and there is a slight semantic difference between the two corresponding to the English one beer and some beer.

Danish verbs in Wikidata and DanNet

Posted on

Besides finding Danish verbs by looking through Danish corpora, see, e.g., Finding Danish verbs in Europarl with Python, you can also find Danish verbs from lexical resources.

Danish Verbs are available in DanNet, the Danish wordnet. These can be extracted from the RDF XML files. Here with a Python script relying on rdflib and a SPARQL query

from rdflib import Graph

filenames = "words part_of_speech".split()

g = Graph()
for filename in filenames:
    g.parse(filename + '.rdf')

query = """
SELECT ?word ?representation {
  ?word dn_schema:partOfSpeech "Verb" .
  ?word wn20schema:lexicalForm ?representation .
}""" 

result = g.query(
    query,
    initNs={
        'dn': 'http://www.wordnet.dk/owl/instance/2009/03/instances/',
        'dn_schema': 'http://www.wordnet.dk/owl/instance/2009/03/schema/',
        'wn20schema': 'http://www.w3.org/2006/03/wn/wn20/schema/',
    })

n = 0
for row in sorted(result, key=lambda row: row[1]):
    if ' ' not in row[1]:
        n += 1
        print("{:11} {}".format(str(row[0])[58:], row[1]))

print("Total number of single-word verbs: {}".format(n))

The full list returns 5581 entries. A number of these verbs include the reflexive “sig” (e.g., “forvente sig”, “foræde sig”) or adverbs (e.g., “abe efter”, “arbejde over”). Excluding these (' ' not in row[1]) one ends up with 3994 single words. Among those words are a number I do not recall having run into before: pizzikere, afhjelme, indsukre, …

In Wikidata, the DanNet word identifier may be entered using the P6140 property.

Not all Danish verbs are in DanNet. In the beginning of January 2022, there were 2862 Danish verbs in Wikidata. This number ranks Danish as number 11 in terms of number of verbs entered, see Ordia’s statistics where Indonesian is ahead with 12,781 verbs followed by the 7,932 of Estonian. In Wikidata, I have marked verbs not available in DanNet with Wikidata’s novalue, so they can be queried with the Wikidata Query Service and this SPARQL query

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 ;
          a wdno:P6140 .
}

In January 2022, there were around 580 Danish verbs in Wikidata that were marked as not being in DanNet, e.g., dibse, indebære, erfare, misforstå and opgradere. Such words are fairly common, not like DanNet’s “afhjelme”, – a word that can cause confusion according to a text at the Danish parliament website.

Not all of Wikidata’s Danish verbs have indicated whether the word is in DanNet or not in DanNet. Wikidata Query Service may tell us about these words

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 .
  FILTER NOT EXISTS { ?lexeme p:P6140 [] }
}

As of 6 January 2022, there were 387 Danish verbs where the DanNet status was not indicated. These instances may be because I forgot to type it in, a typo or it was difficult. For “difficult” consider the word “sigte”. It has three DanNet verb homograph lexemes (aim, charge, sieve). The Wikidata lexeme L207570 had mixed these lexemes through links to the lexemes tilsigte and sigtelse, – as far as I can tell.