Some problems with Danish and Wikidata lexemes

Posted on

 

Ordia-professortiltrædelsesforelæsning

Is is at all possible to describe natural languages in a structured way? There are many special cases and oddities of the Danish language one continuously discovers when entering Danish lexemes in Wikidata.

  1. What do we do with øl (beer). It can both be a neutrum and an common gender (utrum) word and the semantics between the two versions differ. In Wikidata they can both appear under the same lexemes, but how one then keeps track of that one form is associated with one sense and another form with another sense is less clear. Qualifiers may be brought to help. There are, however, to my knowledge no property that currently can be used to this.
  2. Is hænge (hang) one or two lexemes? There is a transitive and an intransitive version where there is a slight semantic difference. Den Danske Ordbog (DDS) has only one entry for hænge and then spend some words explaining the form-sense complexity.  Wikidata has currently L45348 and L45349.
  3. Is the “an” in “ankomme“, “ankomst”, “anslå”, “anvende”, etc. a dedicated prefix or should be regarded as the “an” adverb attached to a verb or a noun? The problem with regarding “an” as a prefix is that many other words that prefix to komst are adverbs: “bort”, “frem”, “ind”, “op”, “sammen”, “til” and these units does not look prefix’ish to me.
  4. It is sometimes not clear where a part of compound should linked to. For instance, tilskadekomst (injury) can be decomposed into “til”, “skade” and “komst”. The “til” could be regarded as a preposition or an adverb. For indeklima (indoor climate), inde could be the adverb inde or the adverb ind plus an -e- interfix.
  5. Should briller (glasses) and bukser (trousers) be plurale tantum? In common language briller and bukser are plurale tantum, but among professional sales persons you find the singular versions brille and buks. How would you indicate that? Note that compounds with the words may have the singular versions, e.g., bukseopslag and brilleetui.
  6. For singulare tantum/countable nouns, singular forms of lexemes may be so prevalent and plural forms so rare that it may be a question whether the word is singulare tantum or a countable noun, e.g., tillid (trust) may be found in rare instances as tillider, but does that make is a countable noun?
  7. What word is komst? Is it a suffix? Then what about the word genkomst, – it has the prefix “gen-” and the the suffix komst…, so where is the root? Maybe it is better to regard it as a part of a tyttebærord, where a word once recognized as an independent word has “lost its independence”. Komst has an entry in the old Danish dictionary, but not in newer Danish dictionaries.
  8. Following Grammatik over det Danske Sprog (GDS), some nouns have been added as “nexual nouns” or “innexual nouns”. The precise delineation of these classes are not clear, e.g., where would agent words such as woman, carpenter and cat be placed? The are not nexual, as far as I can see, but does that make them innexual? There is a range of Danish words where I am unsure: landskab (landscape), musikrum (music room), etc. So far I have left any annotation of such words out.
  9. Where do beslutte and similar words derive from? According to Den Danske Ordbog (DDS), it derives from middelnedertysk “besluten”, but could also be regarded as derived from a “be-” prefix and the verb “slutte”. It is only partially possible to represent both paths in the derivation in Wikidata.
  10. Wikidata has the “lexical category” field. For an affix it is not clear what the category should be. It could be affix, suffix/prefix, or perhaps something else?
  11. A particular class of words at the intersection of nouns and verbs are what has been term centaur. They might be inflections of verbs or they might be derivations from verbs to nouns. Examples are råben (shouting as a noun), løben (running) and skrigen (screaming). They do not seem to have any inflections themselves, so should they then be regarded as just an inflection of a verb and put in as a form under the corresponding verbal lexeme, e.g., råbe? On the other hand, DDS has råben as an independent entry and I also added råben as an independent lexeme in Wikidata. In Wikidata, this enable a more straightforward link to the synonym råberi.
  12. Which lexical category should we link compounds to? Some compounds may be analyzed to arise from a noun or a verb (or possibly other lexical categories), e.g., springvand has the parts spring and vand. It is not – at least to me – clear whether spring should be regarded as linked to the noun spring or to the root form of the verb springe.
  13. Should the s-genitive form of Danish nouns be recorded under forms? The naïve approach is to add the s-genitive forms doubling the amount of Danish noun forms. Modern linguists seem think (if I understand them correctly) that the appended “s” is enclitic and the s-genitive not a form, – much like in English where the ‘s genitive are not recorded as a form. For English the apostroph separates the s from the rest of the word, so there is is natural not to include the genitive form.
  14. Hjem, hjemme and hjemad are three words an possibly one, two or three lexemes. If they are three lexemes then how can we link them?
  15. When is a noun derived from a verb and not the other way around? It is particularly a question for (possible) root derivations, where the noun is shorter than the verb. For the noun bijob and the verb bijobbe it seems that the noun forms the basis for the derivation to the verb.
  16. genforhandling (renegotiation) can (in my opinion) be derived from at least two paths: gen+forhandling (re+negotiation) and genforhandl+ing (renegotiate+ion). The derived property in can contain both, but the combines property is not suitable for this case.
  17. Professortiltrædelsesforelæsning is another word where I am uncertain how to best decompose the word: professor+tiltrædelsesforelæsning or professortiltrædelse+s+forelæsning?
  18. What sort of word is politiker? A nomen agentis is (usually) derived from verbs, and with the analysis of the word into politik+er, then politiker is not a nomen agentis. But Den Store Danske notes that nomen agentis can be derived from other “nomens”, e.g., skuespiller (actor) and skuespil (acting or play). So is it ok to regard politiker as a nomen agentis?
  19. Some words for items that might appear as a collective is a singular concept in Wikidata lexeme and a collective in Wikidata, e.g., dansker (Dane) is danskere (Danes) in Wikidata’s Q-items. Connecting the entities via P5137, is a bit of a stretch. The same may be said to be an issue for animals and species.

Read also my paper Danish in Wikidata lexemes (Scholia) and perhaps also Validating Danish Wikidata lexemes (Scholia).

 

Photo in graph: Reading (7052753377).jpg, Moyan Breen, CC-BY 2.0.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s