Python
ChatGPT, programming exams and hidden data
ChatGPT presents a formidable challenge for teachers doing programming exams with open Internet: ChatGPT usually solves introductory programming exams too easily. One possible way to circumvent a vanilla ChatGPT attack on a programming exam may be to hide some of the data. In my discussion with Tue Herlau (Scholia), he came up with the idea of hiding data from the prompt of ChatGPT. In Python, we can hide the data in a Pickle file:
import pickle
data = {
'year': [2000, 2010, 2011, 2012],
'harvest': [312, 123, 542, 422],
}
with open('data.pickle', 'wb') as f:
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
Now we can give the student the following instructions as a programming exam:
Given the data in the the pickle file ‘data.pickle’, compute the mean of the harvest for the years after 2005.
The student needs to examine the pickle file before writing the program. ChatGPT will generate some relevant code, but handling of the data structure is wrong in this session:
“TypeError: string indices must be integers” is the error here when the code is executed.
Unfortunately, it seems to be relatively easy to let ChatGPT generate the correct code with a two-step approach. Taking part of the solution, we can examine the content of the data:
import pickle
# Load the pickle file
with open('data.pickle', 'rb') as f:
data = pickle.load(f)
print(data)
That will give us “{‘year’: [2000, 2010, 2011, 2012], ‘harvest’: [312, 123, 542, 422]}”
Now we can extend the prompt:
Given the data in the the pickle file ‘data.pickle’, compute the mean of the harvest for the years after 2005.
The data in the file is {‘year’: [2000, 2010, 2011, 2012], ‘harvest’: [312, 123, 542, 422]}
This gives fine code giving the correct result: 362.3333.
The conclusion is that hiding data makes it slightly more difficult using ChatGPT for a programming exam solution, but at least in the above simple case data hiding is not enough.
Extracting and counting variations of a word with a subword in a corpus
With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.
cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n
In the corpus I have from the DREAMS project, a part of the result is
979 støjpåvirkningen 988 miljøpåvirkningerne 1389 støjpåvirkning 1550 påvirkningerne 1699 miljøpåvirkning 2405 påvirker 3858 påvirkes 4130 miljøpåvirkninger 6539 påvirket 8483 påvirkningen 9664 påvirke 9876 påvirkninger 25630 påvirkning
grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:
grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n
A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.
The Python one-liner can be converted to a script
#!/usr/bin/python
import re, sys
if len(sys.argv) < 2:
print("Missing word to search for")
exit(1)
pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')
for line in open(0).readlines():
for match in pattern.findall(line.lower()):
print(match)
Then it can be used with, e.g.,:
cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n
Part-of-speech tags in Det Centrale Ordregister
The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:
#!/usr/bin/python
from collections import Counter
pos = []
for line in open("ro2021-0.9.cor"):
parts = line.split('\t')
pos.append(parts[3].split(".")[0])
counts = Counter(pos)
for word, count in counts.most_common():
print(f"{count:6} {word}")
The result is
339523 sb 92900 adj 79533 vb 1388 prop 904 adv 559 fork 269 kolon 238 talord 196 flerord 147 udråbsord 101 pron 96 præp 64 konj 59 præfiks 36 lydord 2 fsubj 1 infinitivens 1 art
“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).
Verbs in Danish Dependency Treebank
The Danish Dependency Treebank (DDT) v. 1.0 was made by Matthias Trautner Kromann (Scholia) at Copenhagen Business School from 2002 to 2004 according to its README file. It is based on texts from the PAROLE corpus collected by Ole Norling-Christensen (Scholia) from 1983 to 1992. DDT’s XML files are available with the data.
Quick and dirty grepping in DDT on one of the XML files with
grep --text 'msd="V' ddt-1.0.tag | wc
reports 15.597 verbs in the dataset.
A bit more counting with
grep --text 'msd="V' ddt-1.0.tag | python3 -c "print('\n'.join(line.split('\"')[1] for line in open(0, encoding='iso-8859-1').readlines()))" | sort | uniq -c | sort -nr | wc
shows 1.551 unique verb lemmas. With the words written to a file with
… | uniq | sort > ddt-verb-lemmas.txt
a sample of these are:
åbne, accelerere, acceptere, administrere, adskille, advare, ændre, ærgre, ætse, afblæse, afbryde, afdække, affærdige, affinde
How many of these lemmas are in Wikidata which currently have over 2.900 Danish verbs?
ddt_verb_lemmas = set(open('ddt-verb-lemmas.txt').read().split())
query = """
SELECT ?lemma {
?lexeme dct:language wd:Q9035 ;
wikibase:lemma ?lemma ;
wikibase:lexicalCategory wd:Q24905 .
}
"""
import requests
url = "https://query.wikidata.org/sparql"
data = {'query': query, 'format': 'json'}
response = requests.get(url, data)
data = response.json()['results']['bindings']
wikidata_verb_lemmas = set([datum['lemma']['value'] for datum in data])
len(ddt_verb_lemmas - wikidata_verb_lemmas)
gives 269 missing verbs in Wikidata:
affærdige, barrikadere, bedyre, bekomme, belejre, belægge, belæsse, bemale, bemyndige, berettige, berømme, bestjæle, bestorme, betræde, beære, bilægge, bivåne, blusse, blødgøre, brodere, budgettere, bugne, defilere, demoralisere, deportere, destillere, destruere, detronisere, diktere, doble, drapere, dvæle, ekskludere, ekstraindkalde, fabrikslukke, fartbegrænse, fastansætte, fejl-operere, fetere, filosofere, fin-sortere, flade, flagre, flakke, flakse, flamme, flintre, flirte, focusere, foragte, forankre, foranledige, forbeholde, fordufte, forføre, forkalke, forlige, forlise, forlove, formilde, forpligtige, forrykke, forskyde, fortrække, fortvivle, forudsige, fradømme, fralægge, frankere, fraråde, fraskrive, frasortere, fratræde, fremholde, frikende, friske, frustrere, fuldføre, fusionere, fyge, fænge, geare, gennembanke, gennemopleve, gennemprøve, genoptrykke, gestikulere, gispe, gjalde, glatte, gløde, gravere, grunde, gyse, hage, havne, hefte, hegle, henholde, henkaste, hentyde, henvide, humpe, huse, hverve, hvine, hvisle, hyre, hæge, hærde, iblande, illudere, improvisere, indgribe, indhylle, indhøste, indkassere, indkvartere, indoktrinere, indordne, indpakke, indskyde, indsmugle, indstemme, indstifte, indvie, indvælge, jævnføre, kanonere, kante, kikke, kime, kimse, knirke, knuge, kolportere, kommandere, kompromittere, konkretisere, kriminalisere, krone, kue, kuldsejle, kvie, legalisere, lirke, lue, lædere, lække, læsetræne, læspe, løje, løsne, løsrive, medfølge, mestre, mishandle, modsætte, mæske, mønstre, mørkelægge, narre, neddæmpe, nedsable, nedsænke, nedværdige, nidstirre, niveaudele, nyinvestere, nødlande, oparbejde, opdigte, opfange, opfølge, ophidse, oplagre, oplive, opstøve, opsøge, optræne, opøve, overbeglo, overfylde, overhælde, overrende, overrumple, overskrue, overvurdere, passivisere, pervertere, plette, plombere, praktisere, prissætte, profitere, programmere, pryde, præferere, prøvekøre, puffe, pulverisere, påbyde, påklage, pønse, rafle, rappe, rasere, ratificere, rivalisere, rumstere, røbe, røge, sagsøge, sammenstille, sammenstykke, sammentømre, sample, samstemme, scanne, ses, signere, simre, skeje, skippe, sladre, slentre, slynge, sløve, smashe, smugle, snage, spekulere, spraye, stakke, stationere, sukke, surmule, svimle, sønderlemme, tackle, tangere, taxie, tilbede, tildanne, tilsende, tilstå, time, tippe, tjekkere, toppe, trisse, trone, trygle, tv-annoncere, ty, tøffe, udfritte, udglatte, udlære, udmærke, udskære, udstyre, udtørre, ulovliggøre, vandkæmme, virkeliggøre, våge, vænne, værdsætte
Some of these are homonyms with other words: flade, flamme, friske, glatte, grunde, hage, havne, huse, krone, mestre, mønstre, time, toppe.
Three have alternating lemma: indvi/indvie, kigge/kikke, tackle/takle. These are already in Wikidata, so actually only 266 verbs are missing in Wikidata.
One is an independent deponent verb: ses
There are at least two errors in DDT: tjekkere, henvide. They occur in the sentences “Der er også radikale tjekkere, som må ændring holdning.” and “Det så vældig godt ud, der var også udleveret henved 100 fakler.” So 264 missing verbs in Wikidata.
Danish verbs in Wikidata and DanNet
Besides finding Danish verbs by looking through Danish corpora, see, e.g., Finding Danish verbs in Europarl with Python, you can also find Danish verbs from lexical resources.
Danish Verbs are available in DanNet, the Danish wordnet. These can be extracted from the RDF XML files. Here with a Python script relying on rdflib and a SPARQL query
from rdflib import Graph
filenames = "words part_of_speech".split()
g = Graph()
for filename in filenames:
g.parse(filename + '.rdf')
query = """
SELECT ?word ?representation {
?word dn_schema:partOfSpeech "Verb" .
?word wn20schema:lexicalForm ?representation .
}"""
result = g.query(
query,
initNs={
'dn': 'http://www.wordnet.dk/owl/instance/2009/03/instances/',
'dn_schema': 'http://www.wordnet.dk/owl/instance/2009/03/schema/',
'wn20schema': 'http://www.w3.org/2006/03/wn/wn20/schema/',
})
n = 0
for row in sorted(result, key=lambda row: row[1]):
if ' ' not in row[1]:
n += 1
print("{:11} {}".format(str(row[0])[58:], row[1]))
print("Total number of single-word verbs: {}".format(n))
The full list returns 5581 entries. A number of these verbs include the reflexive “sig” (e.g., “forvente sig”, “foræde sig”) or adverbs (e.g., “abe efter”, “arbejde over”). Excluding these (' ' not in row[1
]) one ends up with 3994 single words. Among those words are a number I do not recall having run into before: pizzikere, afhjelme, indsukre, …
In Wikidata, the DanNet word identifier may be entered using the P6140 property.
Not all Danish verbs are in DanNet. In the beginning of January 2022, there were 2862 Danish verbs in Wikidata. This number ranks Danish as number 11 in terms of number of verbs entered, see Ordia’s statistics where Indonesian is ahead with 12,781 verbs followed by the 7,932 of Estonian. In Wikidata, I have marked verbs not available in DanNet with Wikidata’s novalue, so they can be queried with the Wikidata Query Service and this SPARQL query
SELECT ?lexeme ?lemma {
?lexeme wikibase:lemma ?lemma ;
dct:language wd:Q9035 ;
wikibase:lexicalCategory wd:Q24905 ;
a wdno:P6140 .
}
In January 2022, there were around 580 Danish verbs in Wikidata that were marked as not being in DanNet, e.g., dibse, indebære, erfare, misforstå and opgradere. Such words are fairly common, not like DanNet’s “afhjelme”, – a word that can cause confusion according to a text at the Danish parliament website.
Not all of Wikidata’s Danish verbs have indicated whether the word is in DanNet or not in DanNet. Wikidata Query Service may tell us about these words
SELECT ?lexeme ?lemma {
?lexeme wikibase:lemma ?lemma ;
dct:language wd:Q9035 ;
wikibase:lexicalCategory wd:Q24905 .
FILTER NOT EXISTS { ?lexeme p:P6140 [] }
}
As of 6 January 2022, there were 387 Danish verbs where the DanNet status was not indicated. These instances may be because I forgot to type it in, a typo or it was difficult. For “difficult” consider the word “sigte”. It has three DanNet verb homograph lexemes (aim, charge, sieve). The Wikidata lexeme L207570 had mixed these lexemes through links to the lexemes tilsigte and sigtelse, – as far as I can tell.
GitHub statistics for DTU Compute researchers
DTU Compute is a department at the Technical University of Denmark. How many followers does DTU Compute researchers have? This Python code attempts to answer that question.
Code
This is based on the code at my previous blogpost Female GitHubbers.
import re
import requests
import pandas as pd
from time import sleep
query = """
SELECT
?researcher ?researcherLabel ?researcherDescription
(SAMPLE(?github_) AS ?github)
WITH {
SELECT DISTINCT ?researcher WHERE {
?researcher ( wdt:P108 | wdt:P463 | wdt:P1416 ) / wdt:P361* wd:Q23048689 .
}
} AS %researchers
WHERE {
INCLUDE %researchers
?researcher wdt:P2037 ?github_ .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,es,fr,nl,no,ru,sv,zh" . }
}
GROUP BY ?researcher ?researcherLabel ?researcherDescription
"""
response = requests.get("https://query.wikidata.org/sparql",
params={'query': query, 'format': 'json'})
researchers = pd.json_normalize(response.json()['results']['bindings'])
URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
if not re.match('^[a-zA-Z0-9]+$', github):
followers.append(0)
continue
url = URL + github
print(url)
try:
response = requests.get(url,
headers={'Accept':'application/vnd.github.v3+json'})
user_followers = response.json()['followers']
except:
user_followers = 0
followers.append(user_followers)
print("{} {}".format(github, followers))
sleep(5)
researchers['followers'] = followers
columns = ['followers', 'github.value', 'researcherLabel.value',
'researcher.value']
print(researchers.sort_values(['followers'], ascending=False)[columns].to_html(index=False))
Results
As is apparent from the code, the statistics is based on the labelling in Wikidata which is likely incomplete, meaning that people not occurring on the list might not be on GitHub, may not be recorded in Wikidata as on GitHub, or may not be recorded as being associated with DTU Compute.
The two top listings, Anders Boesen Lindbo Larsen and Rasmus Berg Palm, are no longer at DTU Compute, as far as I can tell.
First experiments with the T0 Hugging Face language model
The T0 models was released here in October 2021, available via Hugging Face, see bigscience/T0pp, and described in the paper Multitask Prompted Training Enables Zero-Shot Task Generalization (Scholia). The researchers behind the model claims “The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.”
The small language model, T0_3B, contains 3 milliard parameters and fills up 11 gigabyte of disk space at ~/.cache/huggingface/transformers/a80e28…
After setup of protobuf, torch and transformers, the model can be autodownloaded and test can be run. On the Hugging Face webpage, there is a few lines of Python code with a sentiment analysis example, here converted to use the small model and edited:
>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt"))[0]))
<pad> Positive</s>
It is unclear to me how well these large pre-trained language models speak other languages than English. My knowledge of prompt engineering is also limited. So the below examples are my first-shot naive attempts:
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Hvem er statsminister i Danmark?", return_tensors="pt"))[0]))
<pad> <unk>ystein Svensson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Danish:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Translate to French:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("What city is the second largest in Denmark?", return_tensors="pt"))[0]))
<pad> Copenhagen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote Der er et yndigt land?", return_tensors="pt"))[0]))
<pad> Theodore Roosevelt</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote the song 'She loves you'?", return_tensors="pt"))[0]))
<pad> John Lennon and Paul McCartney</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote 'Der er et yndigt land'?", return_tensors="pt"))[0]))
<pad> <unk>sgeir <unk>sgeirsson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: Who wrote 'Der er et yndigt land'? Answer:", return_tensors="pt"))[0]))
<pad> Henrik Ibsen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: What city is the second largest in Denmark? Answer:", return_tensors="pt"))[0]))
<pad> Copenhagen</s>
Each of these answers takes over 20 seconds to complete on the initial CPU-based system I used.
On the non-Danish question “Who wrote the song ‘She loves you’?” the model gets the answer right, but for the Danish questions it fails.
For the question “Who wrote Der er et yndigt land?”, i.e., who wrote the Danish national anthem, T0_3B answers incorrectly Theodore Roosevelt or Henrik Ibsen depending on the prompt, while the Google search engine returns “Adam Oehlenschläger” for me.
The question can be converted to SPARQL for submission to the Wikidata Query Service:
SELECT ?who {
?work rdfs:label 'Der er et yndigt land'@da ;
( wdt:P50 | wdt:P676 | wdt:P86 | wdt:P58 | wdt:P2679 | wdt:P2680 ) / rdfs:label ?who .
FILTER (LANG(?who) = 'en')
}
The result there is
Adam Oehlenschläger |
Morten Arnfred |
Jørgen Ljungdalh |
Hans Ernst Krøyer |
Oehlenschläger is the author of the text, Krøyer the composer, Arnfred and Ljungdalh is the screenwriter of a Danish film with the same title as the anthem.
Simon Razniewski (Scholia), Gerhard Weikum (Scholia) and et al. have recently published their DL4KG 2021 paper Language Models As or For Knowledge Bases (Scholia) where they contemplate over the limitations and advantages of language models vs. knowledge bases/graphs. They have had access to the GPT-3 language model:
Example: GPT-3 does not have tangible knowledge that Alan Turing was born in London; it merely assigns this a high confidence of 83%. Yann LeCun, on the other hand, is given medium confidence in being a citizen of France and Canada (67% and 26%), but he actually has French and USA citizenship, not Canadian. The LM assigns USA a very low score. The Wikidata KB, on the other hand, only states his French citizenship, not USA. Wikidata is incomplete, but it does not contain any errors.
– Language Models As or For Knowledge Bases, page 2
Danish speech recognition April 2021
Danish speech recognition is a developing field. There is a general concern about the lack of good open annotated speech data to train from and within the sprogteknologi.dk project there is an effort to help establishing more data. The Awesome Danish list records not much of spoken language corpora. Mozilla’s Common Voice has still not been set up for Danish. NST is conveniently licensed under CC0 and the 22kHz speech corpora is 6.7 GB. An example sentence is “Da jeg havde travlt og blev utålmodig ,<komma> tilbød jeg at betale for frikadellerne .<punktum>”. The punctuation is read aloud. In Python, the raw 16 bit audio files may be read with A = np.fromfile('ST1/080601/003_PSA/DA_PSA01.001', dtype='>i2')
, – as far as I can determine.
The speech recognition tool Danspeech is a work of Lars Kai Hansen‘s students, Martin Carsten Nielsen and Rasmus Arpe Fogh Jensen, and is available as a Python package. There is an associated demo repository. I was able – with a bit of hassle – to install danspeech and danspeechdemo in a Python3.7 virtual environment with clones from the GitHub repositories, – not the danspeech version on the cheeseshop (though a fix may now be in place). The demo brings up a webpage where sound can be recorded and transcribed.
My first attempt was a sentence from a psychiatry book with a good number of unusual words: “En tilstand efter indtagelse af et psykoaktivt stof, som medfører forstyrrelser af bevidsthedsniveau, kognitive funktioner, perception, affekt, adfærd eller andre psykofysiologiske funktioner og reaktioner.” It is not clear to me which kind of configuration is the best for such a sentence and my pronounciation may not be the best. The transcription below makes an error for the form of “psykoaktive/t”, “affect/effect” and the unusual word “psykofysiologiske/psykosocial giske”, where “giske” is not a Danish word. And punctuation is missing.
Google’s speech-to-text is another tool for Danish speech recognition. The service is freemium with a demo on its webpage. The transcription of the same sentence as above – but perhaps spoken slightly different – is shown. I was not able to let it show the full sentence.
Female GitHubbers
In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.
The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table
Code
import re
import requests
import pandas as pd
query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
?researcher wdt:P21 wd:Q6581072 .
?researcher wdt:P2037 ?github .
BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])
URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
if not re.match('^[a-zA-Z0-9]+$', github):
followers.append(0)
continue
url = URL + github
try:
response = requests.get(url,
headers={'Accept':'application/vnd.github.v3+json'})
user_followers = response.json()['followers']
except:
user_followers = 0
followers.append(user_followers)
print("{} {}".format(github, followers))
sleep(5)
researchers['followers'] = followers
columns = ['followers', 'github.value', 'researcherLabel.value',
'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))
Results
The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.
Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.
Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.
I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.
Danish stopword lists
Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with
>>> import nltk >>> nltk.corpus.stopwords.words('danish')
MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from
>>> from spacy.lang.da.stop_words import STOP_WORDS
SpaCy 2.03 has 219 words in that list.
MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at https://github.com/stopwords-iso.
The Snowball stemmer has 94 words at http://snowball.tartarus.org/algorithms/danish/stop.txt.
In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:
> install.packages("tm") > library(tm) > stopwords(kind="da")
The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:
import re import nltk import requests url = "http://snowball.tartarus.org/algorithms/danish/stop.txt" snowball_stopwords = re.findall('^(\w+)', requests.get(url).text, flags=re.MULTILINE | re.UNICODE) nltk_stopwords = nltk.corpus.stopwords.words('danish') snowball_stopwords == nltk_stopwords
A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.