Latest Event Updates
Linking from Danish Wikidata lexemes to COR
As I have previously reported the Danish word registry, Det Centrale Ordregister, was launched in May 2022.
The words are identified by COR identifiers, mimicking the Danish CPR (identifier for Danish persons) and CVR (identifier for Danish organizations and companies). There is now a tentative URLs for each lexeme COR. For instance, the lexeme “bankdirektør” (bank manager) has the tentative URL https://ordregister.dk/id/COR.58789/.
In Wikidata, I suggested two properties for the COR identifiers: one for the lexemes and one for the forms. These two properties have now been accepted and are available as P10830 (for a form) and P10831 (for a lexeme). 20 June 2022 statistics in Ordia showed that we now have 44 form CORs and 21 lexeme CORs represented in Wikidata. There are now several hundreds These have been entered by me manually. Version 0.9 or COR has 516,017 form CORs, so entry of the data should be automated if we want to reach good coverage. So far the data entry has been mostly to determine which problems one would run into in the mapping between Wikidata lexemes and COR. And there is quite a lot of thought that needs to go into the “ontology alignment” between COR and Wikidata. Based on the part-of-speech tags here are some notes:
- Article (art): There is currently only one listed in COR: “den” COR.00267. This singleton seems to be an error or a strangeness that I do not understand. We have “en”, “et”, “den”, “det” and “de”. In Wikidata we currently have 3 Danish articles en/et, “den” and “det”. The en/et aggregation follows Den Danske Ordbog which aggregates “en” and “et”. Den Danske Ordbog has “den”, “det” and “de” as articles, but as separate lexemes. So it seems that we in Wikidata has (partially) followed the inconsistency in Den Danske Ordbog with merging “en” and “et” (indefinite articles) and splitting the “det” and “den” (the definite articles). Could it be argued that all should be one lexeme? Retskrivningsordbogen has one lexeme for den/det/de and one lexeme for en/et. Here there seems to be a reasonable consistency. I suspect we will see an update of COR to mirror the Retskrivningsordbog.
- Infinitive marker (“infinitivens”, infinitivens partikel, infinitivens mærke): There is only one: the word “at”. There is a homograph, the conjunction “at”. There is a Wikidata item for the part-of-speech concept: infinitive marker Q85103750. The Wikidata lexeme for the “at” conjunction has been there for a while with L34817 and linked to COR.00145. And now there is a Wikidata lexeme for the “at” infinitive marker as L678570 and linked to COR.00292. Thus this part-of-speech class is complete.
- Formal subject (fsubj):
- Onomatopoeia (lydord): 36 onomatopoeia forms are recorded in COR. Wikidata have had 42 Danish onomatopoeia lexemes. Wikidata has all COR onomatopoeia and linked.
- For instance, Wikidata’s “atju“, “vuf” and “kvæk” do not appear in COR.
- There are only one form for each lexeme, except for the cat sound “miav” which has the forms “miav” and “mjav” (Ordia).
- A problem is to determine what they mean. For instance, what does “sum” means? Could it correspond to the English “buzz” or humming…!? “bums” I neither know what kind of sound it is.
- Prefix (præfiks): There are 59 prefixes in COR. They are represented with one form each. In Wikidata, prefixes are currently mostly represented as affixes or as morphemes. Some of these are regarded as instances of “prækonfiks”, see for instance, “øko-“. In Den Danske Ordbog, “øko-” is recorded as prefix. The type of prefixes that are not recorded in COR is, e.g., “for-“, “u-” and “be-“. Most of the COR prefixes are what has been termed kryptorod/confix or skabsaffiks/pseudoaffix, see, e.g., Substantiviske Komplekse Ord Med Subkonfikser I Moderne Dansk. Though the lexical category does currently not align between COR and Wikidata, it does not seem to matter for the individual linking. Currently, Wikidata does not record a form for prefixes. My reason for that was that the prefixes are not materialized in real words, – only through derivations.
- Conjunction (konj):
- 64 conjunction forms and 62 conjunction lexemes in COR and 66 Danish conjunction lexemes in Wikidata.
- Four conjunction forms in COR are from two lexemes imedens/imens and mens/medens. Mens and Medens were split in Wikidata. Imedens and imens were not represented as conjunctions in Wikidata. The are all linked now.
- “omend” has lemma “om end” and form “omend” in COR. Why I do not know.
- The same is the case with “selvom” where the lemma is “selv om”.
- “dels” is in COR regarded as an adverb. In Den Danske Ordbog dels is a conjunction.
- “plus at” is not in COR.
- “hvorimod” is an adverb in COR and in Den Danske Ordbog. In some other works it is regarded as a “subordinating conjunction” or a “concessive conjunction”.
- The same is the case with “hvor”. In COR and Den Danske Ordbog it is an adverb. In Quasi-synonymy of Danish temporal conjunctions from the anthropocentric point of view it is referred to as a temporal conjunction. There is already a “hvor” adverb in Wikidata.
- How do we fix this? The “hvor” can be merged in Wikidata. For “hvorimod” the lexical category in Wikidata can be changed to adverb and for “dels” Wikidata could somehow note that COR and Den Danske Ordbog disagree.
- Prepositions (præp):
- 96 preposition forms in COR. They have all been added to Wikidata and linked to COR. So this class is complete.
- COR prepositions only have one form.
- “ad” is homographic with two versions, – one from Latin.
- Wikidata has currently “henover” as a preposition. That preposition is not found in COR. Apparently it has not been affected by the so-called 2012 rule.
- Bokmål currently has more preposition (106) than Danish in Wikidata. vis-a-vis is entered as two different lexemes with variation a/à. There are also some words such as østfra, østover, vestfra, … in Bokmål that is not in Danish. In Den Danske Ordbog the corresponding Danish words are “only” adverbs, see, e.g., østover.
- Pronouns (pron): 101 pronoun forms in COR.
- “som” is present in Den Danske Ordbog but not in COR.
- I suspect there are many issues here. I have not yet looked into the lexical category.
- Interjections (udråbsord): 147 interjection forms in COR.
- Phrases (flerord): 196 phrase forms in COR.
- Numerals (talord): 238 numeral forms in COR.
- COR numerals comes with two forms: normal and genitive. Wikidata had so far not recorded a genitive version of Danish numerals.
- “kolon” (kolon): 269 of these forms in COR.
- Abbreviations (fork): 559 abbreviation forms in COR.
- Abbreviations in COR may be recorded with both upper and lower case versions, e.g., ADHD and adhd.
- Abbreviation may have gentive. This include units such as “A” for ampere and “ml.” which is the abbreviation for mellem (between) and mellem does not have genitive (in English it would correspond to between’s?). This seems strange.
- There is usually no explanation for the abbreviations.
- Adverb (adv): 904 adverb forms in COR.
- “hvorimod” is regarded as an adverb, while other works regard it as a conjunction, see Wikidata references at L42250.
- Proper nouns (prop): 1,388 proper noun forms in COR. These are mostly geographical entities
- The proper nouns come with normal form and a genitive form.
- There are some surprising entries: I, L, M and V. Roman numerals I suppose? Why?
- Proper nouns can have alternative forms, e.g., Ålborg/Aalborg.
- Verb (vb): 79,533 verb forms in COR.
- There are passive indefinite verb forms in COR. These have not been entered in Wikidata. They have the same form as the passe finite present form that are already in Wikidata.
- In COR, skryde has two past tense forms in active: skrydede and skrød. But in passive there is only skrydedes, not skrødes. And there is no supinum form for the irregular form.
- Perfectum participium in its adjective function is listed under the verb lexeme. It is not distinguished from a supinum function.
- Common verbs have and være have plural and definite perfectum participium forms listed: hafte and værede. They sound strange to me.
- Adjectives (adj): 92,900 adjective forms in COR.
- Among the forms are some highly unusual, e.g., “aproposere” and “aproposeste”. In Retsskrivningsordbogen “apropos” is regarded as a uninflectable adjective. In Den Danske Ordbog it is not even an adjective. Another example is værd which is listed with forms such as værdere and værdest.
- Even though the common gender and the neutrum forms are the same, they are listed as separate. This is currently not done for the Danish adjectives in Wikidata.
- Perfectum participium verb form is usually not regarded as a adjective, but sometimes they are. A word such as “betinget” is both reported as a “perf.part” verb form and as a separate adjective. The perf.part verb form has only one form for singular indefinite while the adjective form distinguishes between a neutrum and common gender form even though they are the same.
- Adverbs derived from the adjectives are listed under the adjective lexeme.
- Nouns (sb): 339,523 noun forms in COR.
- COR comes with genitive forms that are currently not in Wikidata. This decision was based on one user arguing about the Danish genitive not being a real genitive but a enclitic. We should probably change that in Wikidata, so the genitive form of nouns are recorded.
- Genetive forms in COR are marked as genitive, but non-genitive forms are not marked.
- “druk” is an example of a word where it is difficult to know whether it is a common gender or a neutrum word as no article is used for the word. Only through adjectives or co-reference it might be revealed. COR record the form with two different identifiers: one for the common gender and one for the neutrum.
- “kirsebær” (cherry) is recorded as two different lexemes: one for common gender the other for neutrum gender. They distinguish between the tree and the berry. In Den Danske Ordbog it is one lexeme and difference is explained.
- Many kentaur nouns (words such as råben, skrigen, løben, …) are not recorded in COR, – neither as separate nouns or conjugations of a verb.
- For those few kentaur nouns recorded they come with genitive form. The is odd. “Grammatik over det danske sprog” states they have no genitive form.
- The noun “A”/”a” has two different lexemes: One for the uppercase and one for the lowercase version. This is the same for all letters. I do not see why upper and lowercases letters should be split across lexemes.
Other problems:
How to represents alternative forms, e.g., mørklægge/mørkelægge or højtaler/højttaler. In Wikidata, they are recorded as separate forms and linked individually to their corresponding COR identifier. The “alternative form” Wikidata property is used to link the two spelling variations.
More unusual Danish words
I have previously written about unusual Danish words. Here are a few more.
angstskrig: As noted the Danish word listed in Den Danske Ordbog with most consecutive consonants: ngstskr. Words that have six consecutive consonants in Wikidata are pattebarnssprog, hårdbundsstruktur and sagsbehandlingsskridt, rnsspr, ndsstr and ngsskr, according to a Wikidata Query Service search. They are all compounds. gesandtskab and bekendtskab are non-compound words with five consecutive consonants.
detailplanlægning: The first part of the compound is detajle, but in the compound it has a different form. Den Danske Ordbog has both detailplanlægning and detaljeplanlægning. Den Danske Ordbog does not have detail as a separate word, – only as the prefix detail-.
druk: A noun with an unclear grammatical gender as no article is usually used for the word. Den Danske Ordbog records it as either common gender or neutrum.
mødre: plural of moder, but also of the short form mor. In Wikidata, it was originally entered (probably by me) as two separate lexemes, but they are now merged to a single lexeme.
niveauet/niveauer/niveauerne: Inflections of the Danish word niveau and they have the most consecutive vowels – found so far by searching with Wikidata Query Service.
tippe: Den Danske Ordbog regards this verb as two lexemes: 1 and 2. They are both recorded as coming from English. Likewise are the derived noun tipning also regarded as two lexemes. The Retsskrivningsordbogen only has one tippe.
Coverage of Det Central Ordregister for technical reports
How well does Det Central Ordregister (COR), the Danish national word register, cover words in a corpus of technical reports? Words with the stem “påvirk” are interesting in terms of our DREAMS project. In the project, we process Danish environmental impact assessment reports and the “påvirk” is the stem corresponding to the English word “impact”
“påvirk” words from the COR database can be extracted with
grep "påvirk" ro2021-0.9.cor | awk -F'\t' '{print $1, $5}'
One finds 86 words (forms) matching “påvirk”, with examples:
1 COR.56543.110.01 g-påvirkning
2 COR.56543.111.01 g-påvirkningen
3 COR.56543.112.01 g-påvirkninger
4 COR.56543.113.01 g-påvirkningerne
5 COR.56543.114.01 g-påvirknings
6 COR.56543.115.01 g-påvirkningens
7 COR.56543.116.01 g-påvirkningers
...
81 COR.22506.311.01 upåvirkeligst
82 COR.21653.300.01 upåvirket
83 COR.21653.301.01 upåvirket
84 COR.21653.302.01 upåvirkede
85 COR.21653.303.01 upåvirkede
86 COR.21653.309.01 upåvirket
Some oddities are “letpåvirkeligere” and “upåvirkeligst”. Google search returns practically no examples on the Internet for such words. One sole example is “…i en endnu letpåvirkeligere alder…“.
There are a few compounds: g-påvirkning, LSD-påvirket, narkotikapåvirket, and spirituspåvirket.
As explained on Extracting and counting variations of a word with a subword in a corpus, words from the DREAMS project corpus with the stem “påvirk” can be extracted with
cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n
There are 543 words (forms) with “påvirk”, including spelling errors and/or PDF extraction errors, for instance, “detteafsnitbeskriveshvilketrafikpåvirkninger” and “påvirknng”. There are many compounds. An excerpt is:
230 1 vibrationspåvirknin 231 1 vilpåvirke 232 1 vindmiljøpåvirkningen 233 1 vindmøllerspåvirkningaf 234 1 vindpåvirk 235 1 vindpåvirkningsområde 236 1 vurderingafpåvirkning 237 1 ændretvandpåvirkning 238 2 ammoniakpåvirkninger 239 2 anlægspåvirkninger 240 2 arbejdsmiljøpåvirkninger ... 429 9 klimapåvirkningsgraden 430 9 miljøpåvirket 431 9 temperaturpåvirkninger 432 9 vindpåvirkningerne 433 10 forureningspåvirkning 434 10 kulturpåvirkede 435 10 kulturpåvirket 436 10 påvirkelig 437 10 påvirkende ... 534 1550 påvirkningerne 535 1699 miljøpåvirkning 536 2405 påvirker 537 3858 påvirkes 538 4130 miljøpåvirkninger 539 6539 påvirket 540 8483 påvirkningen 541 9664 påvirke 542 9876 påvirkninger 543 25630 påvirkning
Here the central noun form “påvirkning” appears 25,630 times in the corpus, while the central verb form “påvirke” appears 9,664 times.
All in all there are very few words matched with COR for this particular stem in this particular corpus.
The Danish wordnet, DanNet, has even fewer words matching “påvirk”. With an UTF-8 DanNet word file:
grep påvirk words-utf8.rdf
There are only reported 3 words:
<wn20schema:lexicalForm>påvirke</wn20schema:lexicalForm> <wn20schema:lexicalForm>upåvirkelig</wn20schema:lexicalForm> <wn20schema:lexicalForm>påvirkningsmulighed</wn20schema:lexicalForm>
Extracting and counting variations of a word with a subword in a corpus
With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.
cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n
In the corpus I have from the DREAMS project, a part of the result is
979 støjpåvirkningen 988 miljøpåvirkningerne 1389 støjpåvirkning 1550 påvirkningerne 1699 miljøpåvirkning 2405 påvirker 3858 påvirkes 4130 miljøpåvirkninger 6539 påvirket 8483 påvirkningen 9664 påvirke 9876 påvirkninger 25630 påvirkning
grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:
grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n
A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.
The Python one-liner can be converted to a script
#!/usr/bin/python
import re, sys
if len(sys.argv) < 2:
print("Missing word to search for")
exit(1)
pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')
for line in open(0).readlines():
for match in pattern.findall(line.lower()):
print(match)
Then it can be used with, e.g.,:
cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n
Part-of-speech tags in Det Centrale Ordregister
The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:
#!/usr/bin/python
from collections import Counter
pos = []
for line in open("ro2021-0.9.cor"):
parts = line.split('\t')
pos.append(parts[3].split(".")[0])
counts = Counter(pos)
for word, count in counts.most_common():
print(f"{count:6} {word}")
The result is
339523 sb 92900 adj 79533 vb 1388 prop 904 adv 559 fork 269 kolon 238 talord 196 flerord 147 udråbsord 101 pron 96 præp 64 konj 59 præfiks 36 lydord 2 fsubj 1 infinitivens 1 art
“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).
What SPARQL keywords do we use in Scholia?
Scaling the Wikidata Query Service seems to be a continuing concern for those that run the service. There is a general fear that we will run into hard restrictions with the BlazeGraph software which is setup as a SPARQL endpoint for the Wikidata Query Service. In February 2022, there have been two video sessions where the community has had a chance to give input to a possible alternative/migration, and tell, for instance, what SPARQL query features are important.
In Scholia, we are using a range of SPARQL features. We had no overview of which features we use, but of specialized Wikidata Query Service/Blazegraph features I remember we are using, is the labeling service, the GAS service and the mwapi service. As most of our SPARQL code uses capital letters for keywords and functions and lowercase for variable, we can get a quick and dirty overview of the keywords and functions we are using with
git clone git@github.com:WDscholia/scholia.git
cd scholia/scholia/app/templates/
cat *.sparql | python -c 'import re; print("\n".join(re.findall("[A-Z_]{2,}", open(0).read())))' | sort | uniq -c | sort -nr
These three command-lines give
1179 AS 688 SELECT 650 WHERE 556 BY 407 BIND 313 INCLUDE 296 WITH 293 ORDER 264 DESC 263 GROUP 255 SERVICE 226 PREFIX 226 OPTIONAL 214 UNION 214 COUNT 195 DISTINCT 190 FILTER 147 LIMIT 145 AUTO_LANGUAGE 100 SAMPLE 82 CONCAT 79 STR 73 VALUES 71 GROUP_CONCAT 65 LANG 41 SUBSTR 40 YEAR 32 COALESCE 30 MIN 30 ID 28 EXISTS 27 IF 26 ENCODE_FOR_URI 26 CHEMICALS 24 REPLACE 23 NOT 22 MAX 17 SUM 14 RESULTS 14 ORCID 13 IRI 13 IK 13 ASC 12 CAS 10 URL 10 JOURNAL 10 _CID 8 INTENTION 7 LEGOLAS 6 NOW 6 HAVING 6 CITEDARTICLE 6 BFS 5 PCID 5 MINUS 5 CASID 5 BD 4 URI 4 ROUND 4 OR 4 DOI 3 STRSTARTS 3 ISSN 3 _ID 3 FFFFFF 3 BC 2 UNRESOLVED 2 TO 2 PC 2 MONTH 2 MOLS 2 LCASE 2 DAY 2 _CIDU 2 CASU 2 BB 2 ASK 2 AP 2 ALLOTROPES 1 TODO 1 STRLEN 1 ISBN 1 ID_T 1 GRID 1 GEPRIS 1 FF 1 END 1 EFFBD 1 EEEEEE 1 DDDDDD 1 CORDIS 1 BLANK 1 BFI 1 ABS
Here CHEMICALS and CITEDARTICLE must be varibles, while, e.g., DDDDDD is a color specification. We are using the WITH Blazegraph-specific keyword a lot. This is usually for efficiency. Currently, we have few ASK and no CONSTRUCT.
University course emails per year
I have previous written about university course emails. The above figure shows the development of the number of received emails saved in my ‘teaching’ folder together with the number of received emails saved in my ‘teaching assistants’ folder. The projected number of emails for the year 2022 may be too large because of an unusually large number of emails in January 2022 from the students. As previously noted the counts do not include the automated emails I receive from our question-answering site. I usually delete such emails.
There might be around 220 working days in Denmark, making the average number of emails per day around 10 or less. One should think that handling 10 emails would not amount to more than an hour of work based on my guesstimate in my previous blogpost though as noted the long tail of the distribution of the handling time may make the estimate quite uncertain.
Verbs in Danish Dependency Treebank
The Danish Dependency Treebank (DDT) v. 1.0 was made by Matthias Trautner Kromann (Scholia) at Copenhagen Business School from 2002 to 2004 according to its README file. It is based on texts from the PAROLE corpus collected by Ole Norling-Christensen (Scholia) from 1983 to 1992. DDT’s XML files are available with the data.
Quick and dirty grepping in DDT on one of the XML files with
grep --text 'msd="V' ddt-1.0.tag | wc
reports 15.597 verbs in the dataset.
A bit more counting with
grep --text 'msd="V' ddt-1.0.tag | python3 -c "print('\n'.join(line.split('\"')[1] for line in open(0, encoding='iso-8859-1').readlines()))" | sort | uniq -c | sort -nr | wc
shows 1.551 unique verb lemmas. With the words written to a file with
… | uniq | sort > ddt-verb-lemmas.txt
a sample of these are:
åbne, accelerere, acceptere, administrere, adskille, advare, ændre, ærgre, ætse, afblæse, afbryde, afdække, affærdige, affinde
How many of these lemmas are in Wikidata which currently have over 2.900 Danish verbs?
ddt_verb_lemmas = set(open('ddt-verb-lemmas.txt').read().split())
query = """
SELECT ?lemma {
?lexeme dct:language wd:Q9035 ;
wikibase:lemma ?lemma ;
wikibase:lexicalCategory wd:Q24905 .
}
"""
import requests
url = "https://query.wikidata.org/sparql"
data = {'query': query, 'format': 'json'}
response = requests.get(url, data)
data = response.json()['results']['bindings']
wikidata_verb_lemmas = set([datum['lemma']['value'] for datum in data])
len(ddt_verb_lemmas - wikidata_verb_lemmas)
gives 269 missing verbs in Wikidata:
affærdige, barrikadere, bedyre, bekomme, belejre, belægge, belæsse, bemale, bemyndige, berettige, berømme, bestjæle, bestorme, betræde, beære, bilægge, bivåne, blusse, blødgøre, brodere, budgettere, bugne, defilere, demoralisere, deportere, destillere, destruere, detronisere, diktere, doble, drapere, dvæle, ekskludere, ekstraindkalde, fabrikslukke, fartbegrænse, fastansætte, fejl-operere, fetere, filosofere, fin-sortere, flade, flagre, flakke, flakse, flamme, flintre, flirte, focusere, foragte, forankre, foranledige, forbeholde, fordufte, forføre, forkalke, forlige, forlise, forlove, formilde, forpligtige, forrykke, forskyde, fortrække, fortvivle, forudsige, fradømme, fralægge, frankere, fraråde, fraskrive, frasortere, fratræde, fremholde, frikende, friske, frustrere, fuldføre, fusionere, fyge, fænge, geare, gennembanke, gennemopleve, gennemprøve, genoptrykke, gestikulere, gispe, gjalde, glatte, gløde, gravere, grunde, gyse, hage, havne, hefte, hegle, henholde, henkaste, hentyde, henvide, humpe, huse, hverve, hvine, hvisle, hyre, hæge, hærde, iblande, illudere, improvisere, indgribe, indhylle, indhøste, indkassere, indkvartere, indoktrinere, indordne, indpakke, indskyde, indsmugle, indstemme, indstifte, indvie, indvælge, jævnføre, kanonere, kante, kikke, kime, kimse, knirke, knuge, kolportere, kommandere, kompromittere, konkretisere, kriminalisere, krone, kue, kuldsejle, kvie, legalisere, lirke, lue, lædere, lække, læsetræne, læspe, løje, løsne, løsrive, medfølge, mestre, mishandle, modsætte, mæske, mønstre, mørkelægge, narre, neddæmpe, nedsable, nedsænke, nedværdige, nidstirre, niveaudele, nyinvestere, nødlande, oparbejde, opdigte, opfange, opfølge, ophidse, oplagre, oplive, opstøve, opsøge, optræne, opøve, overbeglo, overfylde, overhælde, overrende, overrumple, overskrue, overvurdere, passivisere, pervertere, plette, plombere, praktisere, prissætte, profitere, programmere, pryde, præferere, prøvekøre, puffe, pulverisere, påbyde, påklage, pønse, rafle, rappe, rasere, ratificere, rivalisere, rumstere, røbe, røge, sagsøge, sammenstille, sammenstykke, sammentømre, sample, samstemme, scanne, ses, signere, simre, skeje, skippe, sladre, slentre, slynge, sløve, smashe, smugle, snage, spekulere, spraye, stakke, stationere, sukke, surmule, svimle, sønderlemme, tackle, tangere, taxie, tilbede, tildanne, tilsende, tilstå, time, tippe, tjekkere, toppe, trisse, trone, trygle, tv-annoncere, ty, tøffe, udfritte, udglatte, udlære, udmærke, udskære, udstyre, udtørre, ulovliggøre, vandkæmme, virkeliggøre, våge, vænne, værdsætte
Some of these are homonyms with other words: flade, flamme, friske, glatte, grunde, hage, havne, huse, krone, mestre, mønstre, time, toppe.
Three have alternating lemma: indvi/indvie, kigge/kikke, tackle/takle. These are already in Wikidata, so actually only 266 verbs are missing in Wikidata.
One is an independent deponent verb: ses
There are at least two errors in DDT: tjekkere, henvide. They occur in the sentences “Der er også radikale tjekkere, som må ændring holdning.” and “Det så vældig godt ud, der var også udleveret henved 100 fakler.” So 264 missing verbs in Wikidata.
Unusual Danish words
ae (to caress): Two vowels word.
bagagebærer (bicycle luggage carrier): Compound of bagage+bærer. The part of the bike that can carry goods. The second term when viewed in isolation a (human) nomen agentis, but when it is a compound it is non-human.
bringe (the verb to bring): Danish conjugation put suffixes or for some change a vowel. For this word both a vowel is changed and a consonant disappears for some conjugated forms (bringe, bringer, bragte, bragt).
bærebar (portable): compound adjective with two parts from the same Indo-European root.
fise (to fart): verb with three different conjugation types. For instance, active voice preterite may be fiste, fisede or fes.
fyr (guy, pine tree, heating unit/light tower): three different noun lexemes. There is also a verb with a conjugation that results in “fyr”.
led: a representation associated with many lexemes. Ordia shows 9 different representations from six different lexemes: mean, suffered, suffer, search, joint, joints, side, gate, gates.
hænge (hang): Two types of conjugations that determines whether the verb is transitive or intransitive.
højttaler/højtaler (loudspeaker): One of the few Danish lexemes with multiple lemmas: an extra “t”.
institutionalisere (institutionalize): Perhaps the longest Danish word with one root. Though the Latin origin is from in- and statuo according to English Wiktionary. Other words of this kind are rekonteksualisere and repræsentantskab.
menneskerettighedsorganisation (human rights organization): The longest Danish word in common usage. It appears in Den Danske Ordbog.
ondskabsfuldhed (malice): long Danish word from one root and three derivation: adjective to noun to adjective to noun.
rødhals (Erithacus rubecula): Compound of rød and hals, red neck. It is not a hyponym of neck, but a particular bird species. Many Danish plant and animal names overrule the pattern that says that the first compound part specifies a type of the second compound.
sandkassesand (sandbox sand): Compound noun with the same noun twice (sand+kasse)+sand. This is a common product, see, e.g., certified “sandkassesand” from Coop.
service (tableware, service): A loanword that has come in twice, once from French, once from English. They have different grammatical gender. The French loanword is neutrum, while the English is common.
sigte (aim, charge, sieve): A lemma with three different verb lexemes. There are furthermore two related noun lexemes (aim, sieve).
skrivelse (a writing, note): A verbal noun that is “innexual”. Usually Danish verbal nouns are denoting an activity. Skrivelse is not the only one of this kind.
værdipapirfinansieringstransaktionseksponering (securities financing transaction exposures): Danish word identified as the longest word yet found in authoritative text. It appears in European Union legislation. : “1.9.3. Endvidere bør de nuværende regler i kapitalkravsforordningen til fastsættelse af løbetidsparametret udstrækkes til at omfatte derivat- og værdipapirfinansieringstransaktionseksponeringer samt transaktioner uden fast løbetid.”
øl (beer): Can have two genders (common and neutrum) and there is a slight semantic difference between the two corresponding to the English one beer and some beer.
Danish verbs in Wikidata and DanNet
Besides finding Danish verbs by looking through Danish corpora, see, e.g., Finding Danish verbs in Europarl with Python, you can also find Danish verbs from lexical resources.
Danish Verbs are available in DanNet, the Danish wordnet. These can be extracted from the RDF XML files. Here with a Python script relying on rdflib and a SPARQL query
from rdflib import Graph
filenames = "words part_of_speech".split()
g = Graph()
for filename in filenames:
g.parse(filename + '.rdf')
query = """
SELECT ?word ?representation {
?word dn_schema:partOfSpeech "Verb" .
?word wn20schema:lexicalForm ?representation .
}"""
result = g.query(
query,
initNs={
'dn': 'http://www.wordnet.dk/owl/instance/2009/03/instances/',
'dn_schema': 'http://www.wordnet.dk/owl/instance/2009/03/schema/',
'wn20schema': 'http://www.w3.org/2006/03/wn/wn20/schema/',
})
n = 0
for row in sorted(result, key=lambda row: row[1]):
if ' ' not in row[1]:
n += 1
print("{:11} {}".format(str(row[0])[58:], row[1]))
print("Total number of single-word verbs: {}".format(n))
The full list returns 5581 entries. A number of these verbs include the reflexive “sig” (e.g., “forvente sig”, “foræde sig”) or adverbs (e.g., “abe efter”, “arbejde over”). Excluding these (' ' not in row[1
]) one ends up with 3994 single words. Among those words are a number I do not recall having run into before: pizzikere, afhjelme, indsukre, …
In Wikidata, the DanNet word identifier may be entered using the P6140 property.
Not all Danish verbs are in DanNet. In the beginning of January 2022, there were 2862 Danish verbs in Wikidata. This number ranks Danish as number 11 in terms of number of verbs entered, see Ordia’s statistics where Indonesian is ahead with 12,781 verbs followed by the 7,932 of Estonian. In Wikidata, I have marked verbs not available in DanNet with Wikidata’s novalue, so they can be queried with the Wikidata Query Service and this SPARQL query
SELECT ?lexeme ?lemma {
?lexeme wikibase:lemma ?lemma ;
dct:language wd:Q9035 ;
wikibase:lexicalCategory wd:Q24905 ;
a wdno:P6140 .
}
In January 2022, there were around 580 Danish verbs in Wikidata that were marked as not being in DanNet, e.g., dibse, indebære, erfare, misforstå and opgradere. Such words are fairly common, not like DanNet’s “afhjelme”, – a word that can cause confusion according to a text at the Danish parliament website.
Not all of Wikidata’s Danish verbs have indicated whether the word is in DanNet or not in DanNet. Wikidata Query Service may tell us about these words
SELECT ?lexeme ?lemma {
?lexeme wikibase:lemma ?lemma ;
dct:language wd:Q9035 ;
wikibase:lexicalCategory wd:Q24905 .
FILTER NOT EXISTS { ?lexeme p:P6140 [] }
}
As of 6 January 2022, there were 387 Danish verbs where the DanNet status was not indicated. These instances may be because I forgot to type it in, a typo or it was difficult. For “difficult” consider the word “sigte”. It has three DanNet verb homograph lexemes (aim, charge, sieve). The Wikidata lexeme L207570 had mixed these lexemes through links to the lexemes tilsigte and sigtelse, – as far as I can tell.