Python

ChatGPT, programming exams and hidden data

Posted on January 5, 2023

ChatGPT presents a formidable challenge for teachers doing programming exams with open Internet: ChatGPT usually solves introductory programming exams too easily. One possible way to circumvent a vanilla ChatGPT attack on a programming exam may be to hide some of the data. In my discussion with Tue Herlau (Scholia), he came up with the idea of hiding data from the prompt of ChatGPT. In Python, we can hide the data in a Pickle file:

import pickle

data = {
    'year': [2000, 2010, 2011, 2012],
    'harvest': [312, 123, 542, 422],
}

with open('data.pickle', 'wb') as f:
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

Now we can give the student the following instructions as a programming exam:

Given the data in the the pickle file ‘data.pickle’, compute the mean of the harvest for the years after 2005.

The student needs to examine the pickle file before writing the program. ChatGPT will generate some relevant code, but handling of the data structure is wrong in this session:

“TypeError: string indices must be integers” is the error here when the code is executed.

Unfortunately, it seems to be relatively easy to let ChatGPT generate the correct code with a two-step approach. Taking part of the solution, we can examine the content of the data:

import pickle

# Load the pickle file
with open('data.pickle', 'rb') as f:
    data = pickle.load(f)

print(data)

That will give us “{‘year’: [2000, 2010, 2011, 2012], ‘harvest’: [312, 123, 542, 422]}”

Now we can extend the prompt:

Given the data in the the pickle file ‘data.pickle’, compute the mean of the harvest for the years after 2005.
The data in the file is {‘year’: [2000, 2010, 2011, 2012], ‘harvest’: [312, 123, 542, 422]}

This gives fine code giving the correct result: 362.3333.

The conclusion is that hiding data makes it slightly more difficult using ChatGPT for a programming exam solution, but at least in the above simple case data hiding is not enough.

This entry was posted in programming, Python, teaching and tagged ChatGPT, exam.

Extracting and counting variations of a word with a subword in a corpus

Posted on June 9, 2022 Updated on June 13, 2022

With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.

cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n

In the corpus I have from the DREAMS project, a part of the result is

    979 støjpåvirkningen
    988 miljøpåvirkningerne
   1389 støjpåvirkning
   1550 påvirkningerne
   1699 miljøpåvirkning
   2405 påvirker
   3858 påvirkes
   4130 miljøpåvirkninger
   6539 påvirket
   8483 påvirkningen
   9664 påvirke
   9876 påvirkninger
  25630 påvirkning

grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:

grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n

A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.

The Python one-liner can be converted to a script

#!/usr/bin/python

import re, sys

if len(sys.argv) < 2:
    print("Missing word to search for")
    exit(1)

pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')

for line in open(0).readlines():
    for match in pattern.findall(line.lower()):
        print(match)

Then it can be used with, e.g.,:

cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n

This entry was posted in language, programming, Python and tagged python.

Part-of-speech tags in Det Centrale Ordregister

Posted on June 8, 2022

The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:

#!/usr/bin/python

from collections import Counter

pos = []

for line in open("ro2021-0.9.cor"):
    parts = line.split('\t')
    pos.append(parts[3].split(".")[0])

counts = Counter(pos)

for word, count in counts.most_common():
    print(f"{count:6} {word}")

The result is

339523 sb
 92900 adj
 79533 vb
  1388 prop
   904 adv
   559 fork
   269 kolon
   238 talord
   196 flerord
   147 udråbsord
   101 pron
    96 præp
    64 konj
    59 præfiks
    36 lydord
     2 fsubj
     1 infinitivens
     1 art

“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).

This entry was posted in language, Python and tagged Det Centrale Ordregister, postagging.

Verbs in Danish Dependency Treebank

Posted on January 26, 2022

The Danish Dependency Treebank (DDT) v. 1.0 was made by Matthias Trautner Kromann (Scholia) at Copenhagen Business School from 2002 to 2004 according to its README file. It is based on texts from the PAROLE corpus collected by Ole Norling-Christensen (Scholia) from 1983 to 1992. DDT’s XML files are available with the data.

Quick and dirty grepping in DDT on one of the XML files with

grep --text 'msd="V' ddt-1.0.tag | wc

reports 15.597 verbs in the dataset.

A bit more counting with

grep --text 'msd="V' ddt-1.0.tag | python3 -c "print('\n'.join(line.split('\"')[1] for line in open(0, encoding='iso-8859-1').readlines()))" | sort | uniq -c | sort -nr | wc

shows 1.551 unique verb lemmas. With the words written to a file with

… | uniq | sort > ddt-verb-lemmas.txt

a sample of these are:

åbne, accelerere, acceptere, administrere, adskille, advare, ændre, ærgre, ætse, afblæse, afbryde, afdække, affærdige, affinde

How many of these lemmas are in Wikidata which currently have over 2.900 Danish verbs?

ddt_verb_lemmas = set(open('ddt-verb-lemmas.txt').read().split())

query = """
SELECT ?lemma {
  ?lexeme dct:language wd:Q9035 ;
          wikibase:lemma ?lemma ;
          wikibase:lexicalCategory wd:Q24905 .
}
"""
import requests
url = "https://query.wikidata.org/sparql"
data = {'query': query, 'format': 'json'}
response = requests.get(url, data)
data = response.json()['results']['bindings']
wikidata_verb_lemmas = set([datum['lemma']['value'] for datum in data])

len(ddt_verb_lemmas - wikidata_verb_lemmas)

gives 269 missing verbs in Wikidata:

affærdige, barrikadere, bedyre, bekomme, belejre, belægge, belæsse, bemale, bemyndige, berettige, berømme, bestjæle, bestorme, betræde, beære, bilægge, bivåne, blusse, blødgøre, brodere, budgettere, bugne, defilere, demoralisere, deportere, destillere, destruere, detronisere, diktere, doble, drapere, dvæle, ekskludere, ekstraindkalde, fabrikslukke, fartbegrænse, fastansætte, fejl-operere, fetere, filosofere, fin-sortere, flade, flagre, flakke, flakse, flamme, flintre, flirte, focusere, foragte, forankre, foranledige, forbeholde, fordufte, forføre, forkalke, forlige, forlise, forlove, formilde, forpligtige, forrykke, forskyde, fortrække, fortvivle, forudsige, fradømme, fralægge, frankere, fraråde, fraskrive, frasortere, fratræde, fremholde, frikende, friske, frustrere, fuldføre, fusionere, fyge, fænge, geare, gennembanke, gennemopleve, gennemprøve, genoptrykke, gestikulere, gispe, gjalde, glatte, gløde, gravere, grunde, gyse, hage, havne, hefte, hegle, henholde, henkaste, hentyde, henvide, humpe, huse, hverve, hvine, hvisle, hyre, hæge, hærde, iblande, illudere, improvisere, indgribe, indhylle, indhøste, indkassere, indkvartere, indoktrinere, indordne, indpakke, indskyde, indsmugle, indstemme, indstifte, indvie, indvælge, jævnføre, kanonere, kante, kikke, kime, kimse, knirke, knuge, kolportere, kommandere, kompromittere, konkretisere, kriminalisere, krone, kue, kuldsejle, kvie, legalisere, lirke, lue, lædere, lække, læsetræne, læspe, løje, løsne, løsrive, medfølge, mestre, mishandle, modsætte, mæske, mønstre, mørkelægge, narre, neddæmpe, nedsable, nedsænke, nedværdige, nidstirre, niveaudele, nyinvestere, nødlande, oparbejde, opdigte, opfange, opfølge, ophidse, oplagre, oplive, opstøve, opsøge, optræne, opøve, overbeglo, overfylde, overhælde, overrende, overrumple, overskrue, overvurdere, passivisere, pervertere, plette, plombere, praktisere, prissætte, profitere, programmere, pryde, præferere, prøvekøre, puffe, pulverisere, påbyde, påklage, pønse, rafle, rappe, rasere, ratificere, rivalisere, rumstere, røbe, røge, sagsøge, sammenstille, sammenstykke, sammentømre, sample, samstemme, scanne, ses, signere, simre, skeje, skippe, sladre, slentre, slynge, sløve, smashe, smugle, snage, spekulere, spraye, stakke, stationere, sukke, surmule, svimle, sønderlemme, tackle, tangere, taxie, tilbede, tildanne, tilsende, tilstå, time, tippe, tjekkere, toppe, trisse, trone, trygle, tv-annoncere, ty, tøffe, udfritte, udglatte, udlære, udmærke, udskære, udstyre, udtørre, ulovliggøre, vandkæmme, virkeliggøre, våge, vænne, værdsætte

Some of these are homonyms with other words: flade, flamme, friske, glatte, grunde, hage, havne, huse, krone, mestre, mønstre, time, toppe.

Three have alternating lemma: indvi/indvie, kigge/kikke, tackle/takle. These are already in Wikidata, so actually only 266 verbs are missing in Wikidata.

One is an independent deponent verb: ses

There are at least two errors in DDT: tjekkere, henvide. They occur in the sentences “Der er også radikale tjekkere, som må ændring holdning.” and “Det så vældig godt ud, der var også udleveret henved 100 fakler.” So 264 missing verbs in Wikidata.

This entry was posted in language, programming, Python and tagged PAROLE, Wikidata.

Danish verbs in Wikidata and DanNet

Posted on January 6, 2022

Besides finding Danish verbs by looking through Danish corpora, see, e.g., Finding Danish verbs in Europarl with Python, you can also find Danish verbs from lexical resources.

Danish Verbs are available in DanNet, the Danish wordnet. These can be extracted from the RDF XML files. Here with a Python script relying on rdflib and a SPARQL query

from rdflib import Graph

filenames = "words part_of_speech".split()

g = Graph()
for filename in filenames:
    g.parse(filename + '.rdf')

query = """
SELECT ?word ?representation {
  ?word dn_schema:partOfSpeech "Verb" .
  ?word wn20schema:lexicalForm ?representation .
}""" 

result = g.query(
    query,
    initNs={
        'dn': 'http://www.wordnet.dk/owl/instance/2009/03/instances/',
        'dn_schema': 'http://www.wordnet.dk/owl/instance/2009/03/schema/',
        'wn20schema': 'http://www.w3.org/2006/03/wn/wn20/schema/',
    })

n = 0
for row in sorted(result, key=lambda row: row[1]):
    if ' ' not in row[1]:
        n += 1
        print("{:11} {}".format(str(row[0])[58:], row[1]))

print("Total number of single-word verbs: {}".format(n))

The full list returns 5581 entries. A number of these verbs include the reflexive “sig” (e.g., “forvente sig”, “foræde sig”) or adverbs (e.g., “abe efter”, “arbejde over”). Excluding these (' ' not in row[1]) one ends up with 3994 single words. Among those words are a number I do not recall having run into before: pizzikere, afhjelme, indsukre, …

In Wikidata, the DanNet word identifier may be entered using the P6140 property.

Not all Danish verbs are in DanNet. In the beginning of January 2022, there were 2862 Danish verbs in Wikidata. This number ranks Danish as number 11 in terms of number of verbs entered, see Ordia’s statistics where Indonesian is ahead with 12,781 verbs followed by the 7,932 of Estonian. In Wikidata, I have marked verbs not available in DanNet with Wikidata’s novalue, so they can be queried with the Wikidata Query Service and this SPARQL query

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 ;
          a wdno:P6140 .
}

In January 2022, there were around 580 Danish verbs in Wikidata that were marked as not being in DanNet, e.g., dibse, indebære, erfare, misforstå and opgradere. Such words are fairly common, not like DanNet’s “afhjelme”, – a word that can cause confusion according to a text at the Danish parliament website.

Not all of Wikidata’s Danish verbs have indicated whether the word is in DanNet or not in DanNet. Wikidata Query Service may tell us about these words

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 .
  FILTER NOT EXISTS { ?lexeme p:P6140 [] }
}

As of 6 January 2022, there were 387 Danish verbs where the DanNet status was not indicated. These instances may be because I forgot to type it in, a typo or it was difficult. For “difficult” consider the word “sigte”. It has three DanNet verb homograph lexemes (aim, charge, sieve). The Wikidata lexeme L207570 had mixed these lexemes through links to the lexemes tilsigte and sigtelse, – as far as I can tell.

This entry was posted in Python, technical and tagged verb, Wikidata.

GitHub statistics for DTU Compute researchers

Posted on December 14, 2021

DTU Compute is a department at the Technical University of Denmark. How many followers does DTU Compute researchers have? This Python code attempts to answer that question.

Code

This is based on the code at my previous blogpost Female GitHubbers.

import re
import requests
import pandas as pd
from time import sleep

query = """
SELECT
  ?researcher ?researcherLabel ?researcherDescription
  (SAMPLE(?github_) AS ?github)
WITH {
  SELECT DISTINCT ?researcher WHERE {
    ?researcher ( wdt:P108 | wdt:P463 | wdt:P1416 ) / wdt:P361* wd:Q23048689 . 
  } 
} AS %researchers
WHERE {
  INCLUDE %researchers
  ?researcher wdt:P2037 ?github_ .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,es,fr,nl,no,ru,sv,zh" . } 
}
GROUP BY ?researcher ?researcherLabel ?researcherDescription 
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.json_normalize(response.json()['results']['bindings'])


URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    print(url)
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']     
    except:
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)


researchers['followers'] = followers
columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']

print(researchers.sort_values(['followers'], ascending=False)[columns].to_html(index=False))

Results

As is apparent from the code, the statistics is based on the labelling in Wikidata which is likely incomplete, meaning that people not occurring on the list might not be on GitHub, may not be recorded in Wikidata as on GitHub, or may not be recorded as being associated with DTU Compute.

followers	github.value	researcherLabel.value	researcher.value
555	andersbll	Anders Boesen Lindbo Larsen	http://www.wikidata.org/entity/Q28829121
537	rasmusbergpalm	Rasmus Berg Palm	http://www.wikidata.org/entity/Q42189239
179	fnielsen	Finn Årup Nielsen	http://www.wikidata.org/entity/Q20980928
79	SkafteNicki	Nicki Skafte Detlefsen	http://www.wikidata.org/entity/Q57080372
57	larsmaaloee	Lars Maaløe	http://www.wikidata.org/entity/Q29016760
35	h0pbeat	Arkadiusz Stopczynski	http://www.wikidata.org/entity/Q28045211
33	suneman	Sune Lehmann	http://www.wikidata.org/entity/Q24390693
26	pmorenoz	Pablo Moreno-Muñoz	http://www.wikidata.org/entity/Q90363825
24	janba	Jakob Andreas Bærentzen	http://www.wikidata.org/entity/Q25939218
19	apengsigkarup	Allan P. Engsig-Karup	http://www.wikidata.org/entity/Q24449285
16	baekgaard	Per Bækgaard	http://www.wikidata.org/entity/Q28045348
12	Bonnevie	Rasmus Bonnevie	http://www.wikidata.org/entity/Q28681282
8	bardram	Jakob E. Bardram	http://www.wikidata.org/entity/Q24389216
5	sfvnielsen	Søren Føns Vind Nielsen	http://www.wikidata.org/entity/Q28948833
5	STherese	Sofie Therese Hansen	http://www.wikidata.org/entity/Q28477916
4	MichaelRiis	Michael Riis Andersen	http://www.wikidata.org/entity/Q30169795
3	bjarkemoensted	Bjarke Mønsted	http://www.wikidata.org/entity/Q56120499
1	KristofferAlbers	Kristoffer Jon Albers	http://www.wikidata.org/entity/Q28845839
1	michaelkaipetersen	Michael Kai Petersen	http://www.wikidata.org/entity/Q24573646
0	andrea-cuttone	Andrea Cuttone	http://www.wikidata.org/entity/Q28045195
0	jakobeglarsen	Jakob Eg Larsen	http://www.wikidata.org/entity/Q25931571
0	North-Guard	Jeppe Nørregaard	http://www.wikidata.org/entity/Q44738835
0	laura-rieger	Laura Rieger	http://www.wikidata.org/entity/Q48975801
0	letizia-marchegiani	Letizia Marchegiani	http://www.wikidata.org/entity/Q56363348
0	ekkart	Ekkart Kindler	http://www.wikidata.org/entity/Q25939354

The two top listings, Anders Boesen Lindbo Larsen and Rasmus Berg Palm, are no longer at DTU Compute, as far as I can tell.

This entry was posted in programming, Python, science and tagged github, Technical University of Denmark.

First experiments with the T0 Hugging Face language model

Posted on October 28, 2021

The T0 models was released here in October 2021, available via Hugging Face, see bigscience/T0pp, and described in the paper Multitask Prompted Training Enables Zero-Shot Task Generalization (Scholia). The researchers behind the model claims “The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.”

The small language model, T0_3B, contains 3 milliard parameters and fills up 11 gigabyte of disk space at ~/.cache/huggingface/transformers/a80e28…

After setup of protobuf, torch and transformers, the model can be autodownloaded and test can be run. On the Hugging Face webpage, there is a few lines of Python code with a sentiment analysis example, here converted to use the small model and edited:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt"))[0]))
<pad> Positive</s>

It is unclear to me how well these large pre-trained language models speak other languages than English. My knowledge of prompt engineering is also limited. So the below examples are my first-shot naive attempts:

>>> print(tokenizer.decode(model.generate(tokenizer.encode("Hvem er statsminister i Danmark?", return_tensors="pt"))[0]))
<pad> <unk>ystein Svensson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Danish:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Translate to French:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("What city is the second largest in Denmark?", return_tensors="pt"))[0]))
<pad> Copenhagen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote Der er et yndigt land?", return_tensors="pt"))[0]))
<pad> Theodore Roosevelt</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote the song 'She loves you'?", return_tensors="pt"))[0]))
<pad> John Lennon and Paul McCartney</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote 'Der er et yndigt land'?", return_tensors="pt"))[0]))
<pad> <unk>sgeir <unk>sgeirsson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: Who wrote 'Der er et yndigt land'? Answer:", return_tensors="pt"))[0]))
<pad> Henrik Ibsen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: What city is the second largest in Denmark? Answer:", return_tensors="pt"))[0]))
<pad> Copenhagen</s>

Each of these answers takes over 20 seconds to complete on the initial CPU-based system I used.

On the non-Danish question “Who wrote the song ‘She loves you’?” the model gets the answer right, but for the Danish questions it fails.

For the question “Who wrote Der er et yndigt land?”, i.e., who wrote the Danish national anthem, T0_3B answers incorrectly Theodore Roosevelt or Henrik Ibsen depending on the prompt, while the Google search engine returns “Adam Oehlenschläger” for me.

The question can be converted to SPARQL for submission to the Wikidata Query Service:

SELECT ?who {
  ?work rdfs:label 'Der er et yndigt land'@da ;   
        ( wdt:P50 | wdt:P676 | wdt:P86 | wdt:P58 | wdt:P2679 | wdt:P2680 ) / rdfs:label ?who .
  FILTER (LANG(?who) = 'en')
}

The result there is

Adam Oehlenschläger

Morten Arnfred

Jørgen Ljungdalh

Hans Ernst Krøyer

Oehlenschläger is the author of the text, Krøyer the composer, Arnfred and Ljungdalh is the screenwriter of a Danish film with the same title as the anthem.

Simon Razniewski (Scholia), Gerhard Weikum (Scholia) and et al. have recently published their DL4KG 2021 paper Language Models As or For Knowledge Bases (Scholia) where they contemplate over the limitations and advantages of language models vs. knowledge bases/graphs. They have had access to the GPT-3 language model:

Example: GPT-3 does not have tangible knowledge that Alan Turing was born in London; it merely assigns this a high confidence of 83%. Yann LeCun, on the other hand, is given medium confidence in being a citizen of France and Canada (67% and 26%), but he actually has French and USA citizenship, not Canadian. The LM assigns USA a very low score. The Wikidata KB, on the other hand, only states his French citizenship, not USA. Wikidata is incomplete, but it does not contain any errors.
– Language Models As or For Knowledge Bases, page 2

This entry was posted in programming, Python, science and tagged language model.

Danish speech recognition April 2021

Posted on April 27, 2021

Danish speech recognition is a developing field. There is a general concern about the lack of good open annotated speech data to train from and within the sprogteknologi.dk project there is an effort to help establishing more data. The Awesome Danish list records not much of spoken language corpora. Mozilla’s Common Voice has still not been set up for Danish. NST is conveniently licensed under CC0 and the 22kHz speech corpora is 6.7 GB. An example sentence is “Da jeg havde travlt og blev utålmodig ,<komma> tilbød jeg at betale for frikadellerne .<punktum>”. The punctuation is read aloud. In Python, the raw 16 bit audio files may be read with A = np.fromfile('ST1/080601/003_PSA/DA_PSA01.001', dtype='>i2'), – as far as I can determine.

The speech recognition tool Danspeech is a work of Lars Kai Hansen‘s students, Martin Carsten Nielsen and Rasmus Arpe Fogh Jensen, and is available as a Python package. There is an associated demo repository. I was able – with a bit of hassle – to install danspeech and danspeechdemo in a Python3.7 virtual environment with clones from the GitHub repositories, – not the danspeech version on the cheeseshop (though a fix may now be in place). The demo brings up a webpage where sound can be recorded and transcribed.

My first attempt was a sentence from a psychiatry book with a good number of unusual words: “En tilstand efter indtagelse af et psykoaktivt stof, som medfører forstyrrelser af bevidsthedsniveau, kognitive funktioner, perception, affekt, adfærd eller andre psykofysiologiske funktioner og reaktioner.” It is not clear to me which kind of configuration is the best for such a sentence and my pronounciation may not be the best. The transcription below makes an error for the form of “psykoaktive/t”, “affect/effect” and the unusual word “psykofysiologiske/psykosocial giske”, where “giske” is not a Danish word. And punctuation is missing.

Google’s speech-to-text is another tool for Danish speech recognition. The service is freemium with a demo on its webpage. The transcription of the same sentence as above – but perhaps spoken slightly different – is shown. I was not able to let it show the full sentence.

This entry was posted in Python, technical and tagged danspeech, speech recognition.

Female GitHubbers

Posted on November 30, 2017 Updated on October 28, 2021

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table

Code

import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])

URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']
    except: 
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))

Results

The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers	github.value	researcherLabel.value	researcher.value
1675	jennybc	Jennifer Bryan	http://www.wikidata.org/entity/Q40579104
1299	jesstess	Jessica McKellar	http://www.wikidata.org/entity/Q19667922
475	triketora	Tracy Chou	http://www.wikidata.org/entity/Q24238925
347	olgabot	Olga B. Botvinnik	http://www.wikidata.org/entity/Q44163048
124	vsoch	Vanessa V. Sochat	http://www.wikidata.org/entity/Q30133235
84	brainwane	Sumana Harihareswara	http://www.wikidata.org/entity/Q18912181
75	lydiapintscher	Lydia Pintscher	http://www.wikidata.org/entity/Q18016466
56	agbeltran	Alejandra González-Beltrán	http://www.wikidata.org/entity/Q27824575
22	frimelle	Lucie-Aimée Kaffee	http://www.wikidata.org/entity/Q37860261
21	isabelleaugenstein	Isabelle Augenstein	http://www.wikidata.org/entity/Q30338957
20	cnap	Courtney Napoles	http://www.wikidata.org/entity/Q42797251
15	tudorache	Tania Tudorache	http://www.wikidata.org/entity/Q29053249
13	vedina	Nina Jeliazkova	http://www.wikidata.org/entity/Q27061849
11	mkutmon	Martina Summer-Kutmon	http://www.wikidata.org/entity/Q27987764
7	caoyler	Catalina Wilmers	http://www.wikidata.org/entity/Q38915853
7	esterpantaleo	Ester Pantaleo	http://www.wikidata.org/entity/Q28949490
6	NuriaQueralt	Núria Queralt Rosinach	http://www.wikidata.org/entity/Q29644228
2	rongwangnu	Rong Wang	http://www.wikidata.org/entity/Q35178434
2	lschiff	Lisa Schiff	http://www.wikidata.org/entity/Q38916007
1	SigridK	Sigrid Klerke	http://www.wikidata.org/entity/Q28152723
1	amrapalijz	Amrapali Zaveri	http://www.wikidata.org/entity/Q34315853
1	mesbahs	Sepideh Mesbah	http://www.wikidata.org/entity/Q30098458
1	ChristineChichester	Christine Chichester	http://www.wikidata.org/entity/Q19845665
1	BinaryStars	Shima Dastgheib	http://www.wikidata.org/entity/Q42091042
1	mollymking	Molly M. King	http://www.wikidata.org/entity/Q40705344
0	jannahastings	Janna Hastings	http://www.wikidata.org/entity/Q27902110
0	nmjakobsen	Nina Munkholt Jakobsen	http://www.wikidata.org/entity/Q38674430

This entry was posted in programming, Python, society and tagged github, Wikidata.

Danish stopword lists

Posted on November 29, 2017 Updated on November 29, 2017

Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with

>>> import nltk
>>> nltk.corpus.stopwords.words('danish')

MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from

>>> from spacy.lang.da.stop_words import STOP_WORDS

SpaCy 2.03 has 219 words in that list.

MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at https://github.com/stopwords-iso.

The Snowball stemmer has 94 words at http://snowball.tartarus.org/algorithms/danish/stop.txt.

In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:

> install.packages("tm")
> library(tm)
> stopwords(kind="da")

The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:

import re
import nltk
import requests

url = "http://snowball.tartarus.org/algorithms/danish/stop.txt"
snowball_stopwords = re.findall('^(\w+)', requests.get(url).text,
                                flags=re.MULTILINE | re.UNICODE)
nltk_stopwords = nltk.corpus.stopwords.words('danish')
snowball_stopwords == nltk_stopwords

A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.

This entry was posted in Python, technical and tagged danish, stopwords, text mining.

	Finn Årup Nielsen on Wikidata and ChatGPT integrati…
	derenrich on Wikidata and ChatGPT integrati…
	Finn Årup Nielsen on Wikidata and ChatGPT integrati…
	derenrich on Wikidata and ChatGPT integrati…
	Wikidata and ChatGPT… on Multihub question answering wi…

Finn Årup Nielsen's blog

– research, science, technology, music, personal opinions, etc.

Python

ChatGPT, programming exams and hidden data

Extracting and counting variations of a word with a subword in a corpus

Part-of-speech tags in Det Centrale Ordregister

Verbs in Danish Dependency Treebank

Danish verbs in Wikidata and DanNet

GitHub statistics for DTU Compute researchers

Code

Results

First experiments with the T0 Hugging Face language model

Danish speech recognition April 2021

Female GitHubbers

Code

Results

Danish stopword lists