Month: June 2022
Coverage of Det Central Ordregister for technical reports
How well does Det Central Ordregister (COR), the Danish national word register, cover words in a corpus of technical reports? Words with the stem “påvirk” are interesting in terms of our DREAMS project. In the project, we process Danish environmental impact assessment reports and the “påvirk” is the stem corresponding to the English word “impact”
“påvirk” words from the COR database can be extracted with
grep "påvirk" ro2021-0.9.cor | awk -F'\t' '{print $1, $5}'
One finds 86 words (forms) matching “påvirk”, with examples:
1 COR.56543.110.01 g-påvirkning
2 COR.56543.111.01 g-påvirkningen
3 COR.56543.112.01 g-påvirkninger
4 COR.56543.113.01 g-påvirkningerne
5 COR.56543.114.01 g-påvirknings
6 COR.56543.115.01 g-påvirkningens
7 COR.56543.116.01 g-påvirkningers
...
81 COR.22506.311.01 upåvirkeligst
82 COR.21653.300.01 upåvirket
83 COR.21653.301.01 upåvirket
84 COR.21653.302.01 upåvirkede
85 COR.21653.303.01 upåvirkede
86 COR.21653.309.01 upåvirket
Some oddities are “letpåvirkeligere” and “upåvirkeligst”. Google search returns practically no examples on the Internet for such words. One sole example is “…i en endnu letpåvirkeligere alder…“.
There are a few compounds: g-påvirkning, LSD-påvirket, narkotikapåvirket, and spirituspåvirket.
As explained on Extracting and counting variations of a word with a subword in a corpus, words from the DREAMS project corpus with the stem “påvirk” can be extracted with
cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n
There are 543 words (forms) with “påvirk”, including spelling errors and/or PDF extraction errors, for instance, “detteafsnitbeskriveshvilketrafikpåvirkninger” and “påvirknng”. There are many compounds. An excerpt is:
230 1 vibrationspåvirknin 231 1 vilpåvirke 232 1 vindmiljøpåvirkningen 233 1 vindmøllerspåvirkningaf 234 1 vindpåvirk 235 1 vindpåvirkningsområde 236 1 vurderingafpåvirkning 237 1 ændretvandpåvirkning 238 2 ammoniakpåvirkninger 239 2 anlægspåvirkninger 240 2 arbejdsmiljøpåvirkninger ... 429 9 klimapåvirkningsgraden 430 9 miljøpåvirket 431 9 temperaturpåvirkninger 432 9 vindpåvirkningerne 433 10 forureningspåvirkning 434 10 kulturpåvirkede 435 10 kulturpåvirket 436 10 påvirkelig 437 10 påvirkende ... 534 1550 påvirkningerne 535 1699 miljøpåvirkning 536 2405 påvirker 537 3858 påvirkes 538 4130 miljøpåvirkninger 539 6539 påvirket 540 8483 påvirkningen 541 9664 påvirke 542 9876 påvirkninger 543 25630 påvirkning
Here the central noun form “påvirkning” appears 25,630 times in the corpus, while the central verb form “påvirke” appears 9,664 times.
All in all there are very few words matched with COR for this particular stem in this particular corpus.
The Danish wordnet, DanNet, has even fewer words matching “påvirk”. With an UTF-8 DanNet word file:
grep påvirk words-utf8.rdf
There are only reported 3 words:
<wn20schema:lexicalForm>påvirke</wn20schema:lexicalForm> <wn20schema:lexicalForm>upåvirkelig</wn20schema:lexicalForm> <wn20schema:lexicalForm>påvirkningsmulighed</wn20schema:lexicalForm>
Extracting and counting variations of a word with a subword in a corpus
With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.
cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n
In the corpus I have from the DREAMS project, a part of the result is
979 støjpåvirkningen 988 miljøpåvirkningerne 1389 støjpåvirkning 1550 påvirkningerne 1699 miljøpåvirkning 2405 påvirker 3858 påvirkes 4130 miljøpåvirkninger 6539 påvirket 8483 påvirkningen 9664 påvirke 9876 påvirkninger 25630 påvirkning
grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:
grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n
A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.
The Python one-liner can be converted to a script
#!/usr/bin/python
import re, sys
if len(sys.argv) < 2:
print("Missing word to search for")
exit(1)
pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')
for line in open(0).readlines():
for match in pattern.findall(line.lower()):
print(match)
Then it can be used with, e.g.,:
cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n
Part-of-speech tags in Det Centrale Ordregister
The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:
#!/usr/bin/python
from collections import Counter
pos = []
for line in open("ro2021-0.9.cor"):
parts = line.split('\t')
pos.append(parts[3].split(".")[0])
counts = Counter(pos)
for word, count in counts.most_common():
print(f"{count:6} {word}")
The result is
339523 sb 92900 adj 79533 vb 1388 prop 904 adv 559 fork 269 kolon 238 talord 196 flerord 147 udråbsord 101 pron 96 præp 64 konj 59 præfiks 36 lydord 2 fsubj 1 infinitivens 1 art
“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).