Latest Event Updates

Linking from Danish Wikidata lexemes to COR

Posted on

As I have previously reported the Danish word registry, Det Centrale Ordregister, was launched in May 2022.

The words are identified by COR identifiers, mimicking the Danish CPR (identifier for Danish persons) and CVR (identifier for Danish organizations and companies). There is now a tentative URLs for each lexeme COR. For instance, the lexeme “bankdirektør” (bank manager) has the tentative URL https://ordregister.dk/id/COR.58789/.

In Wikidata, I suggested two properties for the COR identifiers: one for the lexemes and one for the forms. These two properties have now been accepted and are available as P10830 (for a form) and P10831 (for a lexeme). 20 June 2022 statistics in Ordia showed that we now have 44 form CORs and 21 lexeme CORs represented in Wikidata. There are now several hundreds These have been entered by me manually. Version 0.9 or COR has 516,017 form CORs, so entry of the data should be automated if we want to reach good coverage. So far the data entry has been mostly to determine which problems one would run into in the mapping between Wikidata lexemes and COR. And there is quite a lot of thought that needs to go into the “ontology alignment” between COR and Wikidata. Based on the part-of-speech tags here are some notes:

  1. Article (art): There is currently only one listed in COR: “den” COR.00267. This singleton seems to be an error or a strangeness that I do not understand. We have “en”, “et”, “den”, “det” and “de”. In Wikidata we currently have 3 Danish articles en/et, “den” and “det”. The en/et aggregation follows Den Danske Ordbog which aggregates “en” and “et”. Den Danske Ordbog has “den”, “det” and “de” as articles, but as separate lexemes. So it seems that we in Wikidata has (partially) followed the inconsistency in Den Danske Ordbog with merging “en” and “et” (indefinite articles) and splitting the “det” and “den” (the definite articles). Could it be argued that all should be one lexeme? Retskrivningsordbogen has one lexeme for den/det/de and one lexeme for en/et. Here there seems to be a reasonable consistency. I suspect we will see an update of COR to mirror the Retskrivningsordbog.
  2. Infinitive marker (“infinitivens”, infinitivens partikel, infinitivens mærke): There is only one: the word “at”. There is a homograph, the conjunction “at”. There is a Wikidata item for the part-of-speech concept: infinitive marker Q85103750. The Wikidata lexeme for the “at” conjunction has been there for a while with L34817 and linked to COR.00145. And now there is a Wikidata lexeme for the “at” infinitive marker as L678570 and linked to COR.00292. Thus this part-of-speech class is complete.
  3. Formal subject (fsubj):
    1. COR records two Danish words: “der” and “her”. These words are in Wikidata as L3064 and L45364, respectively, and both with links to COR as COR.00721 and COR.00751, respectively. So this small class is also complete.
  4. Onomatopoeia (lydord): 36 onomatopoeia forms are recorded in COR. Wikidata have had 42 Danish onomatopoeia lexemes. Wikidata has all COR onomatopoeia and linked.
    1. For instance, Wikidata’s “atju“, “vuf” and “kvæk” do not appear in COR.
    2. There are only one form for each lexeme, except for the cat sound “miav” which has the forms “miav” and “mjav” (Ordia).
    3. A problem is to determine what they mean. For instance, what does “sum” means? Could it correspond to the English “buzz” or humming…!? “bums” I neither know what kind of sound it is.
  5. Prefix (præfiks): There are 59 prefixes in COR. They are represented with one form each. In Wikidata, prefixes are currently mostly represented as affixes or as morphemes. Some of these are regarded as instances of “prækonfiks”, see for instance, “øko-“. In Den Danske Ordbog, “øko-” is recorded as prefix. The type of prefixes that are not recorded in COR is, e.g., “for-“, “u-” and “be-“. Most of the COR prefixes are what has been termed kryptorod/confix or skabsaffiks/pseudoaffix, see, e.g., Substantiviske Komplekse Ord Med Subkonfikser I Moderne Dansk. Though the lexical category does currently not align between COR and Wikidata, it does not seem to matter for the individual linking. Currently, Wikidata does not record a form for prefixes. My reason for that was that the prefixes are not materialized in real words, – only through derivations.
  6. Conjunction (konj):
    1. 64 conjunction forms and 62 conjunction lexemes in COR and 66 Danish conjunction lexemes in Wikidata.
    2. Four conjunction forms in COR are from two lexemes imedens/imens and mens/medens. Mens and Medens were split in Wikidata. Imedens and imens were not represented as conjunctions in Wikidata. The are all linked now.
    3. “omend” has lemma “om end” and form “omend” in COR. Why I do not know.
    4. The same is the case with “selvom” where the lemma is “selv om”.
    5. “dels” is in COR regarded as an adverb. In Den Danske Ordbog dels is a conjunction.
    6. plus at” is not in COR.
    7. “hvorimod” is an adverb in COR and in Den Danske Ordbog. In some other works it is regarded as a “subordinating conjunction” or a “concessive conjunction”.
    8. The same is the case with “hvor”. In COR and Den Danske Ordbog it is an adverb. In Quasi-synonymy of Danish temporal conjunctions from the anthropocentric point of view it is referred to as a temporal conjunction. There is already a “hvor” adverb in Wikidata.
    9. How do we fix this? The “hvor” can be merged in Wikidata. For “hvorimod” the lexical category in Wikidata can be changed to adverb and for “dels” Wikidata could somehow note that COR and Den Danske Ordbog disagree.
  7. Prepositions (præp):
    1. 96 preposition forms in COR. They have all been added to Wikidata and linked to COR. So this class is complete.
    2. COR prepositions only have one form.
    3. “ad” is homographic with two versions, – one from Latin.
    4. Wikidata has currently “henover” as a preposition. That preposition is not found in COR. Apparently it has not been affected by the so-called 2012 rule.
    5. Bokmål currently has more preposition (106) than Danish in Wikidata. vis-a-vis is entered as two different lexemes with variation a/à. There are also some words such as østfra, østover, vestfra, … in Bokmål that is not in Danish. In Den Danske Ordbog the corresponding Danish words are “only” adverbs, see, e.g., østover.
  8. Pronouns (pron): 101 pronoun forms in COR.
    1. “som” is present in Den Danske Ordbog but not in COR.
    2. I suspect there are many issues here. I have not yet looked into the lexical category.
  9. Interjections (udråbsord): 147 interjection forms in COR.
  10. Phrases (flerord): 196 phrase forms in COR.
  11. Numerals (talord): 238 numeral forms in COR.
    1. COR numerals comes with two forms: normal and genitive. Wikidata had so far not recorded a genitive version of Danish numerals.
  12. “kolon” (kolon): 269 of these forms in COR.
  13. Abbreviations (fork): 559 abbreviation forms in COR.
    1. Abbreviations in COR may be recorded with both upper and lower case versions, e.g., ADHD and adhd.
    2. Abbreviation may have gentive. This include units such as “A” for ampere and “ml.” which is the abbreviation for mellem (between) and mellem does not have genitive (in English it would correspond to between’s?). This seems strange.
    3. There is usually no explanation for the abbreviations.
  14. Adverb (adv): 904 adverb forms in COR.
    1. “hvorimod” is regarded as an adverb, while other works regard it as a conjunction, see Wikidata references at L42250.
  15. Proper nouns (prop): 1,388 proper noun forms in COR. These are mostly geographical entities
    1. The proper nouns come with normal form and a genitive form.
    1. There are some surprising entries: I, L, M and V. Roman numerals I suppose? Why?
    2. Proper nouns can have alternative forms, e.g., Ålborg/Aalborg.
  16. Verb (vb): 79,533 verb forms in COR.
    1. There are passive indefinite verb forms in COR. These have not been entered in Wikidata. They have the same form as the passe finite present form that are already in Wikidata.
    2. In COR, skryde has two past tense forms in active: skrydede and skrød. But in passive there is only skrydedes, not skrødes. And there is no supinum form for the irregular form.
    3. Perfectum participium in its adjective function is listed under the verb lexeme. It is not distinguished from a supinum function.
    4. Common verbs have and være have plural and definite perfectum participium forms listed: hafte and værede. They sound strange to me.
  17. Adjectives (adj): 92,900 adjective forms in COR.
    1. Among the forms are some highly unusual, e.g., “aproposere” and “aproposeste”. In Retsskrivningsordbogen “apropos” is regarded as a uninflectable adjective. In Den Danske Ordbog it is not even an adjective. Another example is værd which is listed with forms such as værdere and værdest.
    2. Even though the common gender and the neutrum forms are the same, they are listed as separate. This is currently not done for the Danish adjectives in Wikidata.
    3. Perfectum participium verb form is usually not regarded as a adjective, but sometimes they are. A word such as “betinget” is both reported as a “perf.part” verb form and as a separate adjective. The perf.part verb form has only one form for singular indefinite while the adjective form distinguishes between a neutrum and common gender form even though they are the same.
    4. Adverbs derived from the adjectives are listed under the adjective lexeme.
  18. Nouns (sb): 339,523 noun forms in COR.
    1. COR comes with genitive forms that are currently not in Wikidata. This decision was based on one user arguing about the Danish genitive not being a real genitive but a enclitic. We should probably change that in Wikidata, so the genitive form of nouns are recorded.
    2. Genetive forms in COR are marked as genitive, but non-genitive forms are not marked.
    3. “druk” is an example of a word where it is difficult to know whether it is a common gender or a neutrum word as no article is used for the word. Only through adjectives or co-reference it might be revealed. COR record the form with two different identifiers: one for the common gender and one for the neutrum.
    4. “kirsebær” (cherry) is recorded as two different lexemes: one for common gender the other for neutrum gender. They distinguish between the tree and the berry. In Den Danske Ordbog it is one lexeme and difference is explained.
    5. Many kentaur nouns (words such as råben, skrigen, løben, …) are not recorded in COR, – neither as separate nouns or conjugations of a verb.
    6. For those few kentaur nouns recorded they come with genitive form. The is odd. “Grammatik over det danske sprog” states they have no genitive form.
    7. The noun “A”/”a” has two different lexemes: One for the uppercase and one for the lowercase version. This is the same for all letters. I do not see why upper and lowercases letters should be split across lexemes.

Other problems:

How to represents alternative forms, e.g., mørklægge/mørkelægge or højtaler/højttaler. In Wikidata, they are recorded as separate forms and linked individually to their corresponding COR identifier. The “alternative form” Wikidata property is used to link the two spelling variations.

More unusual Danish words

Posted on

I have previously written about unusual Danish words. Here are a few more.

angstskrig: As noted the Danish word listed in Den Danske Ordbog with most consecutive consonants: ngstskr. Words that have six consecutive consonants in Wikidata are pattebarnssprog, hårdbundsstruktur and sagsbehandlingsskridt, rnsspr, ndsstr and ngsskr, according to a Wikidata Query Service search. They are all compounds. gesandtskab and bekendtskab are non-compound words with five consecutive consonants.

detailplanlægning: The first part of the compound is detajle, but in the compound it has a different form. Den Danske Ordbog has both detailplanlægning and detaljeplanlægning. Den Danske Ordbog does not have detail as a separate word, – only as the prefix detail-.

druk: A noun with an unclear grammatical gender as no article is usually used for the word. Den Danske Ordbog records it as either common gender or neutrum.

mødre: plural of moder, but also of the short form mor. In Wikidata, it was originally entered (probably by me) as two separate lexemes, but they are now merged to a single lexeme.

niveauet/niveauer/niveauerne: Inflections of the Danish word niveau and they have the most consecutive vowels – found so far by searching with Wikidata Query Service.

tippe: Den Danske Ordbog regards this verb as two lexemes: 1 and 2. They are both recorded as coming from English. Likewise are the derived noun tipning also regarded as two lexemes. The Retsskrivningsordbogen only has one tippe.

Coverage of Det Central Ordregister for technical reports

Posted on

How well does Det Central Ordregister (COR), the Danish national word register, cover words in a corpus of technical reports? Words with the stem “påvirk” are interesting in terms of our DREAMS project. In the project, we process Danish environmental impact assessment reports and the “påvirk” is the stem corresponding to the English word “impact”

“påvirk” words from the COR database can be extracted with

grep "påvirk" ro2021-0.9.cor | awk -F'\t' '{print $1, $5}'

One finds 86 words (forms) matching “påvirk”, with examples:

      1 COR.56543.110.01 g-påvirkning
      2 COR.56543.111.01 g-påvirkningen
      3 COR.56543.112.01 g-påvirkninger
      4 COR.56543.113.01 g-påvirkningerne
      5 COR.56543.114.01 g-påvirknings
      6 COR.56543.115.01 g-påvirkningens
      7 COR.56543.116.01 g-påvirkningers
    ...
     81 COR.22506.311.01 upåvirkeligst
     82 COR.21653.300.01 upåvirket
     83 COR.21653.301.01 upåvirket
     84 COR.21653.302.01 upåvirkede
     85 COR.21653.303.01 upåvirkede
     86 COR.21653.309.01 upåvirket

Some oddities are “letpåvirkeligere” and “upåvirkeligst”. Google search returns practically no examples on the Internet for such words. One sole example is “…i en endnu letpåvirkeligere alder…“.

There are a few compounds: g-påvirkning, LSD-påvirket, narkotikapåvirket, and spirituspåvirket.

As explained on Extracting and counting variations of a word with a subword in a corpus, words from the DREAMS project corpus with the stem “påvirk” can be extracted with

cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n

There are 543 words (forms) with “påvirk”, including spelling errors and/or PDF extraction errors, for instance, “detteafsnitbeskriveshvilketrafikpåvirkninger” and “påvirknng”. There are many compounds. An excerpt is:

    230       1 vibrationspåvirknin
    231       1 vilpåvirke
    232       1 vindmiljøpåvirkningen
    233       1 vindmøllerspåvirkningaf
    234       1 vindpåvirk
    235       1 vindpåvirkningsområde
    236       1 vurderingafpåvirkning
    237       1 ændretvandpåvirkning
    238       2 ammoniakpåvirkninger
    239       2 anlægspåvirkninger
    240       2 arbejdsmiljøpåvirkninger
    ...
    429       9 klimapåvirkningsgraden
    430       9 miljøpåvirket
    431       9 temperaturpåvirkninger
    432       9 vindpåvirkningerne
    433      10 forureningspåvirkning
    434      10 kulturpåvirkede
    435      10 kulturpåvirket
    436      10 påvirkelig
    437      10 påvirkende
    ...
    534    1550 påvirkningerne
    535    1699 miljøpåvirkning
    536    2405 påvirker
    537    3858 påvirkes
    538    4130 miljøpåvirkninger
    539    6539 påvirket
    540    8483 påvirkningen
    541    9664 påvirke
    542    9876 påvirkninger
    543   25630 påvirkning

Here the central noun form “påvirkning” appears 25,630 times in the corpus, while the central verb form “påvirke” appears 9,664 times.

All in all there are very few words matched with COR for this particular stem in this particular corpus.

The Danish wordnet, DanNet, has even fewer words matching “påvirk”. With an UTF-8 DanNet word file:

grep påvirk words-utf8.rdf

There are only reported 3 words:

    <wn20schema:lexicalForm>påvirke</wn20schema:lexicalForm>
    <wn20schema:lexicalForm>upåvirkelig</wn20schema:lexicalForm>
    <wn20schema:lexicalForm>påvirkningsmulighed</wn20schema:lexicalForm>

Extracting and counting variations of a word with a subword in a corpus

Posted on Updated on

With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.

cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n

In the corpus I have from the DREAMS project, a part of the result is

    979 støjpåvirkningen
    988 miljøpåvirkningerne
   1389 støjpåvirkning
   1550 påvirkningerne
   1699 miljøpåvirkning
   2405 påvirker
   3858 påvirkes
   4130 miljøpåvirkninger
   6539 påvirket
   8483 påvirkningen
   9664 påvirke
   9876 påvirkninger
  25630 påvirkning

grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:

grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n

A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.

The Python one-liner can be converted to a script

#!/usr/bin/python

import re, sys

if len(sys.argv) < 2:
    print("Missing word to search for")
    exit(1)

pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')

for line in open(0).readlines():
    for match in pattern.findall(line.lower()):
        print(match)

Then it can be used with, e.g.,:

cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n

Part-of-speech tags in Det Centrale Ordregister

Posted on

The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:

#!/usr/bin/python

from collections import Counter

pos = []

for line in open("ro2021-0.9.cor"):
    parts = line.split('\t')
    pos.append(parts[3].split(".")[0])

counts = Counter(pos)

for word, count in counts.most_common():
    print(f"{count:6} {word}")

The result is

339523 sb
 92900 adj
 79533 vb
  1388 prop
   904 adv
   559 fork
   269 kolon
   238 talord
   196 flerord
   147 udråbsord
   101 pron
    96 præp
    64 konj
    59 præfiks
    36 lydord
     2 fsubj
     1 infinitivens
     1 art

“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).

What SPARQL keywords do we use in Scholia?

Posted on Updated on

Scaling the Wikidata Query Service seems to be a continuing concern for those that run the service. There is a general fear that we will run into hard restrictions with the BlazeGraph software which is setup as a SPARQL endpoint for the Wikidata Query Service. In February 2022, there have been two video sessions where the community has had a chance to give input to a possible alternative/migration, and tell, for instance, what SPARQL query features are important.

In Scholia, we are using a range of SPARQL features. We had no overview of which features we use, but of specialized Wikidata Query Service/Blazegraph features I remember we are using, is the labeling service, the GAS service and the mwapi service. As most of our SPARQL code uses capital letters for keywords and functions and lowercase for variable, we can get a quick and dirty overview of the keywords and functions we are using with

git clone git@github.com:WDscholia/scholia.git
cd scholia/scholia/app/templates/
cat *.sparql | python -c 'import re; print("\n".join(re.findall("[A-Z_]{2,}", open(0).read())))' | sort | uniq -c | sort -nr

These three command-lines give

   1179 AS
    688 SELECT
    650 WHERE
    556 BY
    407 BIND
    313 INCLUDE
    296 WITH
    293 ORDER
    264 DESC
    263 GROUP
    255 SERVICE
    226 PREFIX
    226 OPTIONAL
    214 UNION
    214 COUNT
    195 DISTINCT
    190 FILTER
    147 LIMIT
    145 AUTO_LANGUAGE
    100 SAMPLE
     82 CONCAT
     79 STR
     73 VALUES
     71 GROUP_CONCAT
     65 LANG
     41 SUBSTR
     40 YEAR
     32 COALESCE
     30 MIN
     30 ID
     28 EXISTS
     27 IF
     26 ENCODE_FOR_URI
     26 CHEMICALS
     24 REPLACE
     23 NOT
     22 MAX
     17 SUM
     14 RESULTS
     14 ORCID
     13 IRI
     13 IK
     13 ASC
     12 CAS
     10 URL
     10 JOURNAL
     10 _CID
      8 INTENTION
      7 LEGOLAS
      6 NOW
      6 HAVING
      6 CITEDARTICLE
      6 BFS
      5 PCID
      5 MINUS
      5 CASID
      5 BD
      4 URI
      4 ROUND
      4 OR
      4 DOI
      3 STRSTARTS
      3 ISSN
      3 _ID
      3 FFFFFF
      3 BC
      2 UNRESOLVED
      2 TO
      2 PC
      2 MONTH
      2 MOLS
      2 LCASE
      2 DAY
      2 _CIDU
      2 CASU
      2 BB
      2 ASK
      2 AP
      2 ALLOTROPES
      1 TODO
      1 STRLEN
      1 ISBN
      1 ID_T
      1 GRID
      1 GEPRIS
      1 FF
      1 END
      1 EFFBD
      1 EEEEEE
      1 DDDDDD
      1 CORDIS
      1 BLANK
      1 BFI
      1 ABS

Here CHEMICALS and CITEDARTICLE must be varibles, while, e.g., DDDDDD is a color specification. We are using the WITH Blazegraph-specific keyword a lot. This is usually for efficiency. Currently, we have few ASK and no CONSTRUCT.

University course emails per year

Posted on

I have previous written about university course emails. The above figure shows the development of the number of received emails saved in my ‘teaching’ folder together with the number of received emails saved in my ‘teaching assistants’ folder. The projected number of emails for the year 2022 may be too large because of an unusually large number of emails in January 2022 from the students. As previously noted the counts do not include the automated emails I receive from our question-answering site. I usually delete such emails.

There might be around 220 working days in Denmark, making the average number of emails per day around 10 or less. One should think that handling 10 emails would not amount to more than an hour of work based on my guesstimate in my previous blogpost though as noted the long tail of the distribution of the handling time may make the estimate quite uncertain.

Verbs in Danish Dependency Treebank

Posted on

The Danish Dependency Treebank (DDT) v. 1.0 was made by Matthias Trautner Kromann (Scholia) at Copenhagen Business School from 2002 to 2004 according to its README file. It is based on texts from the PAROLE corpus collected by Ole Norling-Christensen (Scholia) from 1983 to 1992. DDT’s XML files are available with the data.

Quick and dirty grepping in DDT on one of the XML files with

grep --text 'msd="V' ddt-1.0.tag | wc

reports 15.597 verbs in the dataset.

A bit more counting with

grep --text 'msd="V' ddt-1.0.tag | python3 -c "print('\n'.join(line.split('\"')[1] for line in open(0, encoding='iso-8859-1').readlines()))" | sort | uniq -c | sort -nr | wc

shows 1.551 unique verb lemmas. With the words written to a file with

… | uniq | sort > ddt-verb-lemmas.txt

a sample of these are:

åbne, accelerere, acceptere, administrere, adskille, advare, ændre, ærgre, ætse, afblæse, afbryde, afdække, affærdige, affinde

How many of these lemmas are in Wikidata which currently have over 2.900 Danish verbs?

ddt_verb_lemmas = set(open('ddt-verb-lemmas.txt').read().split())

query = """
SELECT ?lemma {
  ?lexeme dct:language wd:Q9035 ;
          wikibase:lemma ?lemma ;
          wikibase:lexicalCategory wd:Q24905 .
}
"""
import requests
url = "https://query.wikidata.org/sparql"
data = {'query': query, 'format': 'json'}
response = requests.get(url, data)
data = response.json()['results']['bindings']
wikidata_verb_lemmas = set([datum['lemma']['value'] for datum in data])

len(ddt_verb_lemmas - wikidata_verb_lemmas)

gives 269 missing verbs in Wikidata:

affærdige, barrikadere, bedyre, bekomme, belejre, belægge, belæsse, bemale, bemyndige, berettige, berømme, bestjæle, bestorme, betræde, beære, bilægge, bivåne, blusse, blødgøre, brodere, budgettere, bugne, defilere, demoralisere, deportere, destillere, destruere, detronisere, diktere, doble, drapere, dvæle, ekskludere, ekstraindkalde, fabrikslukke, fartbegrænse, fastansætte, fejl-operere, fetere, filosofere, fin-sortere, flade, flagre, flakke, flakse, flamme, flintre, flirte, focusere, foragte, forankre, foranledige, forbeholde, fordufte, forføre, forkalke, forlige, forlise, forlove, formilde, forpligtige, forrykke, forskyde, fortrække, fortvivle, forudsige, fradømme, fralægge, frankere, fraråde, fraskrive, frasortere, fratræde, fremholde, frikende, friske, frustrere, fuldføre, fusionere, fyge, fænge, geare, gennembanke, gennemopleve, gennemprøve, genoptrykke, gestikulere, gispe, gjalde, glatte, gløde, gravere, grunde, gyse, hage, havne, hefte, hegle, henholde, henkaste, hentyde, henvide, humpe, huse, hverve, hvine, hvisle, hyre, hæge, hærde, iblande, illudere, improvisere, indgribe, indhylle, indhøste, indkassere, indkvartere, indoktrinere, indordne, indpakke, indskyde, indsmugle, indstemme, indstifte, indvie, indvælge, jævnføre, kanonere, kante, kikke, kime, kimse, knirke, knuge, kolportere, kommandere, kompromittere, konkretisere, kriminalisere, krone, kue, kuldsejle, kvie, legalisere, lirke, lue, lædere, lække, læsetræne, læspe, løje, løsne, løsrive, medfølge, mestre, mishandle, modsætte, mæske, mønstre, mørkelægge, narre, neddæmpe, nedsable, nedsænke, nedværdige, nidstirre, niveaudele, nyinvestere, nødlande, oparbejde, opdigte, opfange, opfølge, ophidse, oplagre, oplive, opstøve, opsøge, optræne, opøve, overbeglo, overfylde, overhælde, overrende, overrumple, overskrue, overvurdere, passivisere, pervertere, plette, plombere, praktisere, prissætte, profitere, programmere, pryde, præferere, prøvekøre, puffe, pulverisere, påbyde, påklage, pønse, rafle, rappe, rasere, ratificere, rivalisere, rumstere, røbe, røge, sagsøge, sammenstille, sammenstykke, sammentømre, sample, samstemme, scanne, ses, signere, simre, skeje, skippe, sladre, slentre, slynge, sløve, smashe, smugle, snage, spekulere, spraye, stakke, stationere, sukke, surmule, svimle, sønderlemme, tackle, tangere, taxie, tilbede, tildanne, tilsende, tilstå, time, tippe, tjekkere, toppe, trisse, trone, trygle, tv-annoncere, ty, tøffe, udfritte, udglatte, udlære, udmærke, udskære, udstyre, udtørre, ulovliggøre, vandkæmme, virkeliggøre, våge, vænne, værdsætte

Some of these are homonyms with other words: flade, flamme, friske, glatte, grunde, hage, havne, huse, krone, mestre, mønstre, time, toppe.

Three have alternating lemma: indvi/indvie, kigge/kikke, tackle/takle. These are already in Wikidata, so actually only 266 verbs are missing in Wikidata.

One is an independent deponent verb: ses

There are at least two errors in DDT: tjekkere, henvide. They occur in the sentences “Der er også radikale tjekkere, som må ændring holdning.” and “Det så vældig godt ud, der var også udleveret henved 100 fakler.” So 264 missing verbs in Wikidata.

Unusual Danish words

Posted on Updated on

ae (to caress): Two vowels word.

bagagebærer (bicycle luggage carrier): Compound of bagage+bærer. The part of the bike that can carry goods. The second term when viewed in isolation a (human) nomen agentis, but when it is a compound it is non-human.

bringe (the verb to bring): Danish conjugation put suffixes or for some change a vowel. For this word both a vowel is changed and a consonant disappears for some conjugated forms (bringe, bringer, bragte, bragt).

bærebar (portable): compound adjective with two parts from the same Indo-European root.

fise (to fart): verb with three different conjugation types. For instance, active voice preterite may be fiste, fisede or fes.

fyr (guy, pine tree, heating unit/light tower): three different noun lexemes. There is also a verb with a conjugation that results in “fyr”.

led: a representation associated with many lexemes. Ordia shows 9 different representations from six different lexemes: mean, suffered, suffer, search, joint, joints, side, gate, gates.

hænge (hang): Two types of conjugations that determines whether the verb is transitive or intransitive.

højttaler/højtaler (loudspeaker): One of the few Danish lexemes with multiple lemmas: an extra “t”.

institutionalisere (institutionalize): Perhaps the longest Danish word with one root. Though the Latin origin is from in- and statuo according to English Wiktionary. Other words of this kind are rekonteksualisere and repræsentantskab.

menneskerettighedsorganisation (human rights organization): The longest Danish word in common usage. It appears in Den Danske Ordbog.

ondskabsfuldhed (malice): long Danish word from one root and three derivation: adjective to noun to adjective to noun.

rødhals (Erithacus rubecula): Compound of rød and hals, red neck. It is not a hyponym of neck, but a particular bird species. Many Danish plant and animal names overrule the pattern that says that the first compound part specifies a type of the second compound.

sandkassesand (sandbox sand): Compound noun with the same noun twice (sand+kasse)+sand. This is a common product, see, e.g., certified “sandkassesand” from Coop.

service (tableware, service): A loanword that has come in twice, once from French, once from English. They have different grammatical gender. The French loanword is neutrum, while the English is common.

sigte (aim, charge, sieve): A lemma with three different verb lexemes. There are furthermore two related noun lexemes (aim, sieve).

skrivelse (a writing, note): A verbal noun that is “innexual”. Usually Danish verbal nouns are denoting an activity. Skrivelse is not the only one of this kind.

værdipapirfinansieringstransaktionseksponering (securities financing transaction exposures): Danish word identified as the longest word yet found in authoritative text. It appears in European Union legislation. : “1.9.3. Endvidere bør de nuværende regler i kapitalkravsforordningen til fastsættelse af løbetidsparametret udstrækkes til at omfatte derivat- og værdipapirfinansieringstransaktionseksponeringer samt transaktioner uden fast løbetid.”

øl (beer): Can have two genders (common and neutrum) and there is a slight semantic difference between the two corresponding to the English one beer and some beer.

Danish verbs in Wikidata and DanNet

Posted on

Besides finding Danish verbs by looking through Danish corpora, see, e.g., Finding Danish verbs in Europarl with Python, you can also find Danish verbs from lexical resources.

Danish Verbs are available in DanNet, the Danish wordnet. These can be extracted from the RDF XML files. Here with a Python script relying on rdflib and a SPARQL query

from rdflib import Graph

filenames = "words part_of_speech".split()

g = Graph()
for filename in filenames:
    g.parse(filename + '.rdf')

query = """
SELECT ?word ?representation {
  ?word dn_schema:partOfSpeech "Verb" .
  ?word wn20schema:lexicalForm ?representation .
}""" 

result = g.query(
    query,
    initNs={
        'dn': 'http://www.wordnet.dk/owl/instance/2009/03/instances/',
        'dn_schema': 'http://www.wordnet.dk/owl/instance/2009/03/schema/',
        'wn20schema': 'http://www.w3.org/2006/03/wn/wn20/schema/',
    })

n = 0
for row in sorted(result, key=lambda row: row[1]):
    if ' ' not in row[1]:
        n += 1
        print("{:11} {}".format(str(row[0])[58:], row[1]))

print("Total number of single-word verbs: {}".format(n))

The full list returns 5581 entries. A number of these verbs include the reflexive “sig” (e.g., “forvente sig”, “foræde sig”) or adverbs (e.g., “abe efter”, “arbejde over”). Excluding these (' ' not in row[1]) one ends up with 3994 single words. Among those words are a number I do not recall having run into before: pizzikere, afhjelme, indsukre, …

In Wikidata, the DanNet word identifier may be entered using the P6140 property.

Not all Danish verbs are in DanNet. In the beginning of January 2022, there were 2862 Danish verbs in Wikidata. This number ranks Danish as number 11 in terms of number of verbs entered, see Ordia’s statistics where Indonesian is ahead with 12,781 verbs followed by the 7,932 of Estonian. In Wikidata, I have marked verbs not available in DanNet with Wikidata’s novalue, so they can be queried with the Wikidata Query Service and this SPARQL query

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 ;
          a wdno:P6140 .
}

In January 2022, there were around 580 Danish verbs in Wikidata that were marked as not being in DanNet, e.g., dibse, indebære, erfare, misforstå and opgradere. Such words are fairly common, not like DanNet’s “afhjelme”, – a word that can cause confusion according to a text at the Danish parliament website.

Not all of Wikidata’s Danish verbs have indicated whether the word is in DanNet or not in DanNet. Wikidata Query Service may tell us about these words

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 .
  FILTER NOT EXISTS { ?lexeme p:P6140 [] }
}

As of 6 January 2022, there were 387 Danish verbs where the DanNet status was not indicated. These instances may be because I forgot to type it in, a typo or it was difficult. For “difficult” consider the word “sigte”. It has three DanNet verb homograph lexemes (aim, charge, sieve). The Wikidata lexeme L207570 had mixed these lexemes through links to the lexemes tilsigte and sigtelse, – as far as I can tell.