Month: August 2022

Linking from Danish Wikidata lexemes to COR

Posted on Updated on

As I have previously reported the Danish word registry, Det Centrale Ordregister, was launched in May 2022.

The words are identified by COR identifiers, mimicking the Danish CPR (identifier for Danish persons) and CVR (identifier for Danish organizations and companies). There is now a tentative URLs for each lexeme COR. For instance, the lexeme “bankdirektør” (bank manager) has the tentative URL https://ordregister.dk/id/COR.58789/.

In Wikidata, I suggested two properties for the COR identifiers: one for the lexemes and one for the forms. These two properties have now been accepted and are available as P10830 (for a form) and P10831 (for a lexeme). 20 June 2022 statistics in Ordia showed that we now have 44 form CORs and 21 lexeme CORs represented in Wikidata. There are now several hundreds These have been entered by me manually. Version 0.9 or COR has 516,017 form CORs, so entry of the data should be automated if we want to reach good coverage. So far the data entry has been mostly to determine which problems one would run into in the mapping between Wikidata lexemes and COR. And there is quite a lot of thought that needs to go into the “ontology alignment” between COR and Wikidata. Based on the part-of-speech tags here are some notes:

  1. Article (art): There is currently only one listed in COR: “den” COR.00267. This singleton seems to be an error or a strangeness that I do not understand. We have “en”, “et”, “den”, “det” and “de”. In Wikidata we currently have 3 Danish articles en/et, “den” and “det”. The en/et aggregation follows Den Danske Ordbog which aggregates “en” and “et”. Den Danske Ordbog has “den”, “det” and “de” as articles, but as separate lexemes. So it seems that we in Wikidata has (partially) followed the inconsistency in Den Danske Ordbog with merging “en” and “et” (indefinite articles) and splitting the “det” and “den” (the definite articles). Could it be argued that all should be one lexeme? Retskrivningsordbogen has one lexeme for den/det/de and one lexeme for en/et. Here there seems to be a reasonable consistency. I suspect we will see an update of COR to mirror the Retskrivningsordbog.
  2. Infinitive marker (“infinitivens”, infinitivens partikel, infinitivens mærke): There is only one: the word “at”. There is a homograph, the conjunction “at”. There is a Wikidata item for the part-of-speech concept: infinitive marker Q85103750. The Wikidata lexeme for the “at” conjunction has been there for a while with L34817 and linked to COR.00145. And now there is a Wikidata lexeme for the “at” infinitive marker as L678570 and linked to COR.00292. Thus this part-of-speech class is complete.
  3. Formal subject (fsubj):
    1. COR records two Danish words: “der” and “her”. These words are in Wikidata as L3064 and L45364, respectively, and both with links to COR as COR.00721 and COR.00751, respectively. So this small class is also complete.
  4. Onomatopoeia (lydord): 36 onomatopoeia forms are recorded in COR. Wikidata have had 42 Danish onomatopoeia lexemes. Wikidata has all COR onomatopoeia and linked.
    1. For instance, Wikidata’s “atju“, “vuf” and “kvæk” do not appear in COR.
    2. There is only one form for each lexeme, except for the cat sound “miav” which has the forms “miav” and “mjav” (Ordia).
    3. A problem is to determine what they mean. For instance, what does “sum” means? Could it correspond to the English “buzz” or humming…!? “bums” I neither know what kind of sound it is.
  5. Prefix (præfiks): There are 59 prefixes in COR. They are represented with one form each. In Wikidata, prefixes are currently mostly represented as affixes or as morphemes. Some of these are regarded as instances of “prækonfiks”, see for instance, “øko-“. In Den Danske Ordbog, “øko-” is recorded as prefix. The type of prefixes that are not recorded in COR is, e.g., “for-“, “u-” and “be-“. Most of the COR prefixes are what has been termed kryptorod/confix or skabsaffiks/pseudoaffix, see, e.g., Substantiviske Komplekse Ord Med Subkonfikser I Moderne Dansk. Though the lexical category does currently not align between COR and Wikidata, it does not seem to matter for the individual linking. Currently, Wikidata does not record a form for prefixes. My reason for that was that the prefixes are not materialized in real words, – only through derivations.
  6. Conjunction (konj):
    1. 64 conjunction forms and 62 conjunction lexemes in COR and 66 Danish conjunction lexemes in Wikidata.
    2. Four conjunction forms in COR are from two lexemes imedens/imens and mens/medens. Mens and Medens were split in Wikidata. Imedens and imens were not represented as conjunctions in Wikidata. The are all linked now.
    3. “omend” has lemma “om end” and form “omend” in COR. Why I do not know.
    4. The same is the case with “selvom” where the lemma is “selv om”.
    5. “dels” is in COR regarded as an adverb. In Den Danske Ordbog dels is a conjunction.
    6. plus at” is not in COR.
    7. “hvorimod” is an adverb in COR and in Den Danske Ordbog. In some other works it is regarded as a “subordinating conjunction” or a “concessive conjunction”.
    8. The same is the case with “hvor”. In COR and Den Danske Ordbog it is an adverb. In Quasi-synonymy of Danish temporal conjunctions from the anthropocentric point of view it is referred to as a temporal conjunction. There is already a “hvor” adverb in Wikidata.
    9. How do we fix this? The “hvor” can be merged in Wikidata. For “hvorimod” the lexical category in Wikidata can be changed to adverb and for “dels” Wikidata could somehow note that COR and Den Danske Ordbog disagree.
  7. Prepositions (præp):
    1. 96 preposition forms in COR. They have all been added to Wikidata and linked to COR. So this class is complete.
    2. COR prepositions only have one form.
    3. “ad” is homographic with two versions, – one from Latin.
    4. Wikidata has currently “henover” as a preposition. That preposition is not found in COR. Apparently it has not been affected by the so-called 2012 rule.
    5. Bokmål currently has more preposition (106) than Danish in Wikidata. vis-a-vis is entered as two different lexemes with variation a/à. There are also some words such as østfra, østover, vestfra, … in Bokmål that is not in Danish. In Den Danske Ordbog the corresponding Danish words are “only” adverbs, see, e.g., østover.
  8. Pronouns (pron): 101 pronoun forms in COR.
    1. “som” is present in Den Danske Ordbog but not in COR.
    2. I suspect there are many issues here. I have not yet looked into the lexical category.
  9. Interjections (udråbsord): 147 interjection forms in COR.
  10. Phrases (flerord): 196 phrase forms in COR.
  11. Numerals (talord): 238 numeral forms in COR.
    1. COR numerals comes with two forms: normal and genitive. Wikidata had so far not recorded a genitive version of Danish numerals.
  12. “kolon” (kolon): 269 of these forms in COR.
  13. Abbreviations (fork): 559 abbreviation forms in COR.
    1. Abbreviations in COR may be recorded with both upper and lower case versions, e.g., ADHD and adhd.
    2. Abbreviation may have gentive. This include units such as “A” for ampere and “ml.” which is the abbreviation for mellem (between) and mellem does not have genitive (in English it would correspond to between’s?). This seems strange.
    3. There is usually no explanation for the abbreviations.
  14. Adverb (adv): 904 adverb forms in COR.
    1. “hvorimod” is regarded as an adverb, while other works regard it as a conjunction, see Wikidata references at L42250.
  15. Proper nouns (prop): 1,388 proper noun forms in COR. These are mostly geographical entities
    1. The proper nouns come with normal form and a genitive form.
    1. There are some surprising entries: I, L, M and V. Roman numerals I suppose? Why?
    2. Proper nouns can have alternative forms, e.g., Ålborg/Aalborg.
  16. Verb (vb): 79,533 verb forms in COR.
    1. There are passive indefinite verb forms in COR. These have not been entered in Wikidata. They have the same form as the passive finite present form that is already in Wikidata.
    2. In COR, skryde has two past tense forms in active: skrydede and skrød. But in passive there is only skrydedes, not skrødes. And there is no supinum form for the irregular form.
    3. Perfectum participium in its adjective function is listed under the verb lexeme. It is not distinguished from a supinum function.
    4. Common verbs have and være have plural and definite perfectum participium forms listed: hafte and værede. They sound strange to me.
  17. Adjectives (adj): 92,900 adjective forms in COR.
    1. Among the forms are some highly unusual, e.g., “aproposere” and “aproposeste”. In Retsskrivningsordbogen “apropos” is regarded as a uninflectable adjective. In Den Danske Ordbog it is not even an adjective. Another example is værd which is listed with forms such as værdere and værdest.
    2. Even though the common gender and the neutrum forms are the same, they are listed as separate. This is currently not done for the Danish adjectives in Wikidata.
    3. Perfectum participium verb form is usually not regarded as a adjective, but sometimes they are. A word such as “betinget” is both reported as a “perf.part” verb form and as a separate adjective. The perf.part verb form has only one form for singular indefinite while the adjective form distinguishes between a neutrum and common gender form even though they are the same.
    4. Adverbs derived from the adjectives are listed under the adjective lexeme.
  18. Nouns (sb): 339,523 noun forms in COR.
    1. COR comes with genitive forms that are currently not in Wikidata. This decision was based on one user arguing about the Danish genitive not being a real genitive but an enclitic. We should probably change that in Wikidata, so the genitive form of nouns are recorded.
    2. Genitive forms in COR are marked as genitive, but non-genitive forms are not marked.
    3. “druk” is an example of a word where it is difficult to know whether it is a common gender or a neutrum word as no article is used for the word. Only through adjectives or co-reference it might be revealed. COR record the form with two different identifiers: one for the common gender and one for the neutrum.
    4. “kirsebær” (cherry) is recorded as two different lexemes: one for common gender the other for neutrum gender. They distinguish between the tree and the berry. In Den Danske Ordbog it is one lexeme and difference is explained.
    5. Many kentaur nouns (words such as råben, skrigen, løben, …) are not recorded in COR, – neither as separate nouns or conjugations of a verb.
    6. For those few kentaur nouns recorded they come with genitive form. This is odd. “Grammatik over det danske sprog” states they have no genitive form.
    7. The noun “A”/”a” has two different lexemes: One for the uppercase and one for the lowercase version. This is the same for all letters. I do not see why upper and lowercases letters should be split across lexemes.

Other problems:

How to represents alternative forms, e.g., mørklægge/mørkelægge or højtaler/højttaler. In Wikidata, they are recorded as separate forms and linked individually to their corresponding COR identifier. The “alternative form” Wikidata property is used to link the two spelling variations.