Month: June 2022

Coverage of Det Central Ordregister for technical reports

Posted on June 22, 2022

How well does Det Central Ordregister (COR), the Danish national word register, cover words in a corpus of technical reports? Words with the stem “påvirk” are interesting in terms of our DREAMS project. In the project, we process Danish environmental impact assessment reports and the “påvirk” is the stem corresponding to the English word “impact”

“påvirk” words from the COR database can be extracted with

grep "påvirk" ro2021-0.9.cor | awk -F'\t' '{print $1, $5}'

One finds 86 words (forms) matching “påvirk”, with examples:

      1 COR.56543.110.01 g-påvirkning
      2 COR.56543.111.01 g-påvirkningen
      3 COR.56543.112.01 g-påvirkninger
      4 COR.56543.113.01 g-påvirkningerne
      5 COR.56543.114.01 g-påvirknings
      6 COR.56543.115.01 g-påvirkningens
      7 COR.56543.116.01 g-påvirkningers
    ...
     81 COR.22506.311.01 upåvirkeligst
     82 COR.21653.300.01 upåvirket
     83 COR.21653.301.01 upåvirket
     84 COR.21653.302.01 upåvirkede
     85 COR.21653.303.01 upåvirkede
     86 COR.21653.309.01 upåvirket

Some oddities are “letpåvirkeligere” and “upåvirkeligst”. Google search returns practically no examples on the Internet for such words. One sole example is “…i en endnu letpåvirkeligere alder…“.

There are a few compounds: g-påvirkning, LSD-påvirket, narkotikapåvirket, and spirituspåvirket.

As explained on Extracting and counting variations of a word with a subword in a corpus, words from the DREAMS project corpus with the stem “påvirk” can be extracted with

cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n

There are 543 words (forms) with “påvirk”, including spelling errors and/or PDF extraction errors, for instance, “detteafsnitbeskriveshvilketrafikpåvirkninger” and “påvirknng”. There are many compounds. An excerpt is:

    230       1 vibrationspåvirknin
    231       1 vilpåvirke
    232       1 vindmiljøpåvirkningen
    233       1 vindmøllerspåvirkningaf
    234       1 vindpåvirk
    235       1 vindpåvirkningsområde
    236       1 vurderingafpåvirkning
    237       1 ændretvandpåvirkning
    238       2 ammoniakpåvirkninger
    239       2 anlægspåvirkninger
    240       2 arbejdsmiljøpåvirkninger
    ...
    429       9 klimapåvirkningsgraden
    430       9 miljøpåvirket
    431       9 temperaturpåvirkninger
    432       9 vindpåvirkningerne
    433      10 forureningspåvirkning
    434      10 kulturpåvirkede
    435      10 kulturpåvirket
    436      10 påvirkelig
    437      10 påvirkende
    ...
    534    1550 påvirkningerne
    535    1699 miljøpåvirkning
    536    2405 påvirker
    537    3858 påvirkes
    538    4130 miljøpåvirkninger
    539    6539 påvirket
    540    8483 påvirkningen
    541    9664 påvirke
    542    9876 påvirkninger
    543   25630 påvirkning

Here the central noun form “påvirkning” appears 25,630 times in the corpus, while the central verb form “påvirke” appears 9,664 times.

All in all there are very few words matched with COR for this particular stem in this particular corpus.

The Danish wordnet, DanNet, has even fewer words matching “påvirk”. With an UTF-8 DanNet word file:

grep påvirk words-utf8.rdf

There are only reported 3 words:

    <wn20schema:lexicalForm>påvirke</wn20schema:lexicalForm>
    <wn20schema:lexicalForm>upåvirkelig</wn20schema:lexicalForm>
    <wn20schema:lexicalForm>påvirkningsmulighed</wn20schema:lexicalForm>

This entry was posted in language, technical and tagged Det Centrale Ordregister, DREAMS.

Extracting and counting variations of a word with a subword in a corpus

Posted on June 9, 2022 Updated on June 13, 2022

With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.

cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n

In the corpus I have from the DREAMS project, a part of the result is

    979 støjpåvirkningen
    988 miljøpåvirkningerne
   1389 støjpåvirkning
   1550 påvirkningerne
   1699 miljøpåvirkning
   2405 påvirker
   3858 påvirkes
   4130 miljøpåvirkninger
   6539 påvirket
   8483 påvirkningen
   9664 påvirke
   9876 påvirkninger
  25630 påvirkning

grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:

grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n

A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.

The Python one-liner can be converted to a script

#!/usr/bin/python

import re, sys

if len(sys.argv) < 2:
    print("Missing word to search for")
    exit(1)

pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')

for line in open(0).readlines():
    for match in pattern.findall(line.lower()):
        print(match)

Then it can be used with, e.g.,:

cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n

This entry was posted in language, programming, Python and tagged python.

Part-of-speech tags in Det Centrale Ordregister

Posted on June 8, 2022

The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:

#!/usr/bin/python

from collections import Counter

pos = []

for line in open("ro2021-0.9.cor"):
    parts = line.split('\t')
    pos.append(parts[3].split(".")[0])

counts = Counter(pos)

for word, count in counts.most_common():
    print(f"{count:6} {word}")

The result is

339523 sb
 92900 adj
 79533 vb
  1388 prop
   904 adv
   559 fork
   269 kolon
   238 talord
   196 flerord
   147 udråbsord
   101 pron
    96 præp
    64 konj
    59 præfiks
    36 lydord
     2 fsubj
     1 infinitivens
     1 art

“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).

This entry was posted in language, Python and tagged Det Centrale Ordregister, postagging.

	Finn Årup Nielsen on Wikidata and ChatGPT integrati…
	derenrich on Wikidata and ChatGPT integrati…
	Finn Årup Nielsen on Wikidata and ChatGPT integrati…
	derenrich on Wikidata and ChatGPT integrati…
	Wikidata and ChatGPT… on Multihub question answering wi…

Finn Årup Nielsen's blog

– research, science, technology, music, personal opinions, etc.

Month: June 2022

Coverage of Det Central Ordregister for technical reports

Extracting and counting variations of a word with a subword in a corpus

Part-of-speech tags in Det Centrale Ordregister