Month: September 2022
2022 September status on Danish lexemes in Wikidata
Some statistics:
- 15,047 Danish lexemes in Wikidata compared to 689,114 across all languages. 15th largest language in Wikidata, – Russian has over 100,000 lexemes.
- Links to Danish external identifiers:
- 7,820 Wikidata lexemes linked to words DanNet and 4,182 indicated as not in DanNet.
- 1,140 COR lemmas and 213 lexemes indicated as not in COR.
- 1,103 COR forms and 69 forms indicated as not in COR.
- 774 Oqaasileriffik online dictionary ID (either Danish, Greenlandic or English)
- 66,618 Danish forms. 18th language, – Estonian has almost 3 million.
- 7,590 Danish senses. 11th language, – Basque has over 30,000 senses.
- 11,960 hyphenation specifications.
- 2,867 unique hyphenation parts: ne, de, te, le, se, ge, en, ning, re, be
- Lexical categories for Danish lexemes:
Interesting individual lexemes:
- rød, – a word with many compounds specified.
- værdipapirfinansieringstransaktionseksponering, longest Danish lexeme attested in Den Europæiske Centralbanks udtalelse af 8. november 2017 om ændringer til EU’s ramme for kapitalkrav til kreditinstitutter og investeringsselskaber
- koronavirus, – a lexeme with many different alternative forms, due to two variations: korona/corona, virusser/virus/vira.
- led (representation), – a word with many homographs
Images from Wikimedia Commons. Licenses available from links at https://ordia.toolforge.org/L2310.
- Transitive/intransitive/monotransitive/bitransitive verbs
- “Nexual“/”innexual” nouns
- Countable/mass noun
- Compound words
- “Kentaurnominal“
- Color term
- Agent noun
- …
Descriptions, use cases for instance:
- Europarl
- Dansk-russisk ordbog
- Fremmedordbog
- Danish: a comprehensive grammar
- Dansk Grammatik
- Elementær Dansk Grammatik
- Grammatik over det Danske Sprog
- Substantiviske Komplekse Ord Med Subkonfikser I Moderne Dansk
- Solenergianlæg ved Bjerndrup
Alignment with COR
I am working on examining the alignment between
- Missing genitive on nouns and numerals in Wikidata.
- øl (øllen/øllet), plan (planen/planet): They are separate in COR but one lexeme in Den Danske Ordbog and currently in Wikidata.
- Adverbs from adjectives. They are under adjectives in COR.
- Adjectives from verbs (perfektum participium), e.g., ubekymret (in COR as an adjective), bekymret (not in COR as an adjective)
- Various schemes for adjectives, e.g., -sk adjectives do not have differences in grammatical number.
See also the earlier posts Linking from Danish Wikidata lexemes to COR and Part-of-speech tags in Det Centrale Ordregister.
Ordia
Ordia is a web application for Wikidata lexemes.
Tools in the application:
- Lexeme linking with “text to lexemes“
- Language guess with “text to languages“
The tool has apparent around 200 pageviews per day: