Python

ChatGPT, programming exams and hidden data

Posted on

ChatGPT presents a formidable challenge for teachers doing programming exams with open Internet: ChatGPT usually solves introductory programming exams too easily. One possible way to circumvent a vanilla ChatGPT attack on a programming exam may be to hide some of the data. In my discussion with Tue Herlau (Scholia), he came up with the idea of hiding data from the prompt of ChatGPT. In Python, we can hide the data in a Pickle file:

import pickle

data = {
    'year': [2000, 2010, 2011, 2012],
    'harvest': [312, 123, 542, 422],
}

with open('data.pickle', 'wb') as f:
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

Now we can give the student the following instructions as a programming exam:

Given the data in the the pickle file ‘data.pickle’, compute the mean of the harvest for the years after 2005.

The student needs to examine the pickle file before writing the program. ChatGPT will generate some relevant code, but handling of the data structure is wrong in this session:

“TypeError: string indices must be integers” is the error here when the code is executed.

Unfortunately, it seems to be relatively easy to let ChatGPT generate the correct code with a two-step approach. Taking part of the solution, we can examine the content of the data:

import pickle

# Load the pickle file
with open('data.pickle', 'rb') as f:
    data = pickle.load(f)

print(data)

That will give us “{‘year’: [2000, 2010, 2011, 2012], ‘harvest’: [312, 123, 542, 422]}”

Now we can extend the prompt:

Given the data in the the pickle file ‘data.pickle’, compute the mean of the harvest for the years after 2005.
The data in the file is {‘year’: [2000, 2010, 2011, 2012], ‘harvest’: [312, 123, 542, 422]}

This gives fine code giving the correct result: 362.3333.

The conclusion is that hiding data makes it slightly more difficult using ChatGPT for a programming exam solution, but at least in the above simple case data hiding is not enough.

Extracting and counting variations of a word with a subword in a corpus

Posted on Updated on

With a one-liner one can count the number of times variations of a subword occurs. Here with a corpus in the file “sentences.txt” and searching for words containing “påvirk” and using Python for the matching.

cat sentences.txt | python -c "import re; p = re.compile(r'(\w*påvirk\w*)'); [print(m) for line in open(0).readlines() for m in p.findall(line.lower())]" | sort | uniq -c | sort -n

In the corpus I have from the DREAMS project, a part of the result is

    979 støjpåvirkningen
    988 miljøpåvirkningerne
   1389 støjpåvirkning
   1550 påvirkningerne
   1699 miljøpåvirkning
   2405 påvirker
   3858 påvirkes
   4130 miljøpåvirkninger
   6539 påvirket
   8483 påvirkningen
   9664 påvirke
   9876 påvirkninger
  25630 påvirkning

grep has issue with \w and locale that I have not been able to resolve. This does not count correctly:

grep -Pio "\w*påvirk\w*" sentences.txt | sort | uniq -c | sort -n

A word such is “miljøpåvirkning” is not matched. A number of other Linux tools does not necessarily work the way one would expect, see also case conversion.

The Python one-liner can be converted to a script

#!/usr/bin/python

import re, sys

if len(sys.argv) < 2:
    print("Missing word to search for")
    exit(1)

pattern = re.compile(r'(\w*' + sys.argv[1] + r'\w*)')

for line in open(0).readlines():
    for match in pattern.findall(line.lower()):
        print(match)

Then it can be used with, e.g.,:

cat sentences.txt | extract-word påvirk | sort | uniq -c | sort -n

Part-of-speech tags in Det Centrale Ordregister

Posted on

The Danish word registry Det Centrale Ordregister was launched end of May 2022. A file with the resource called ro2021-0.9.cor is available from the homepage. The part-of-speech tags used in the resource can be counted with this Python script:

#!/usr/bin/python

from collections import Counter

pos = []

for line in open("ro2021-0.9.cor"):
    parts = line.split('\t')
    pos.append(parts[3].split(".")[0])

counts = Counter(pos)

for word, count in counts.most_common():
    print(f"{count:6} {word}")

The result is

339523 sb
 92900 adj
 79533 vb
  1388 prop
   904 adv
   559 fork
   269 kolon
   238 talord
   196 flerord
   147 udråbsord
   101 pron
    96 præp
    64 konj
    59 præfiks
    36 lydord
     2 fsubj
     1 infinitivens
     1 art

“fsubj” if “formelt subjekt”, – a special Danish word type, see, e.g., “Der – som formelt subjekt i dansk” (Scholia). “kolon” I have never heard about before. Words such as alstyrende, altforbarmende, amok, anno, anstigende, arilds and attentiden is labeled as “kolon”. “infinitivens” is the word “at” (“to” for the infinitive in English).

Verbs in Danish Dependency Treebank

Posted on

The Danish Dependency Treebank (DDT) v. 1.0 was made by Matthias Trautner Kromann (Scholia) at Copenhagen Business School from 2002 to 2004 according to its README file. It is based on texts from the PAROLE corpus collected by Ole Norling-Christensen (Scholia) from 1983 to 1992. DDT’s XML files are available with the data.

Quick and dirty grepping in DDT on one of the XML files with

grep --text 'msd="V' ddt-1.0.tag | wc

reports 15.597 verbs in the dataset.

A bit more counting with

grep --text 'msd="V' ddt-1.0.tag | python3 -c "print('\n'.join(line.split('\"')[1] for line in open(0, encoding='iso-8859-1').readlines()))" | sort | uniq -c | sort -nr | wc

shows 1.551 unique verb lemmas. With the words written to a file with

… | uniq | sort > ddt-verb-lemmas.txt

a sample of these are:

åbne, accelerere, acceptere, administrere, adskille, advare, ændre, ærgre, ætse, afblæse, afbryde, afdække, affærdige, affinde

How many of these lemmas are in Wikidata which currently have over 2.900 Danish verbs?

ddt_verb_lemmas = set(open('ddt-verb-lemmas.txt').read().split())

query = """
SELECT ?lemma {
  ?lexeme dct:language wd:Q9035 ;
          wikibase:lemma ?lemma ;
          wikibase:lexicalCategory wd:Q24905 .
}
"""
import requests
url = "https://query.wikidata.org/sparql"
data = {'query': query, 'format': 'json'}
response = requests.get(url, data)
data = response.json()['results']['bindings']
wikidata_verb_lemmas = set([datum['lemma']['value'] for datum in data])

len(ddt_verb_lemmas - wikidata_verb_lemmas)

gives 269 missing verbs in Wikidata:

affærdige, barrikadere, bedyre, bekomme, belejre, belægge, belæsse, bemale, bemyndige, berettige, berømme, bestjæle, bestorme, betræde, beære, bilægge, bivåne, blusse, blødgøre, brodere, budgettere, bugne, defilere, demoralisere, deportere, destillere, destruere, detronisere, diktere, doble, drapere, dvæle, ekskludere, ekstraindkalde, fabrikslukke, fartbegrænse, fastansætte, fejl-operere, fetere, filosofere, fin-sortere, flade, flagre, flakke, flakse, flamme, flintre, flirte, focusere, foragte, forankre, foranledige, forbeholde, fordufte, forføre, forkalke, forlige, forlise, forlove, formilde, forpligtige, forrykke, forskyde, fortrække, fortvivle, forudsige, fradømme, fralægge, frankere, fraråde, fraskrive, frasortere, fratræde, fremholde, frikende, friske, frustrere, fuldføre, fusionere, fyge, fænge, geare, gennembanke, gennemopleve, gennemprøve, genoptrykke, gestikulere, gispe, gjalde, glatte, gløde, gravere, grunde, gyse, hage, havne, hefte, hegle, henholde, henkaste, hentyde, henvide, humpe, huse, hverve, hvine, hvisle, hyre, hæge, hærde, iblande, illudere, improvisere, indgribe, indhylle, indhøste, indkassere, indkvartere, indoktrinere, indordne, indpakke, indskyde, indsmugle, indstemme, indstifte, indvie, indvælge, jævnføre, kanonere, kante, kikke, kime, kimse, knirke, knuge, kolportere, kommandere, kompromittere, konkretisere, kriminalisere, krone, kue, kuldsejle, kvie, legalisere, lirke, lue, lædere, lække, læsetræne, læspe, løje, løsne, løsrive, medfølge, mestre, mishandle, modsætte, mæske, mønstre, mørkelægge, narre, neddæmpe, nedsable, nedsænke, nedværdige, nidstirre, niveaudele, nyinvestere, nødlande, oparbejde, opdigte, opfange, opfølge, ophidse, oplagre, oplive, opstøve, opsøge, optræne, opøve, overbeglo, overfylde, overhælde, overrende, overrumple, overskrue, overvurdere, passivisere, pervertere, plette, plombere, praktisere, prissætte, profitere, programmere, pryde, præferere, prøvekøre, puffe, pulverisere, påbyde, påklage, pønse, rafle, rappe, rasere, ratificere, rivalisere, rumstere, røbe, røge, sagsøge, sammenstille, sammenstykke, sammentømre, sample, samstemme, scanne, ses, signere, simre, skeje, skippe, sladre, slentre, slynge, sløve, smashe, smugle, snage, spekulere, spraye, stakke, stationere, sukke, surmule, svimle, sønderlemme, tackle, tangere, taxie, tilbede, tildanne, tilsende, tilstå, time, tippe, tjekkere, toppe, trisse, trone, trygle, tv-annoncere, ty, tøffe, udfritte, udglatte, udlære, udmærke, udskære, udstyre, udtørre, ulovliggøre, vandkæmme, virkeliggøre, våge, vænne, værdsætte

Some of these are homonyms with other words: flade, flamme, friske, glatte, grunde, hage, havne, huse, krone, mestre, mønstre, time, toppe.

Three have alternating lemma: indvi/indvie, kigge/kikke, tackle/takle. These are already in Wikidata, so actually only 266 verbs are missing in Wikidata.

One is an independent deponent verb: ses

There are at least two errors in DDT: tjekkere, henvide. They occur in the sentences “Der er også radikale tjekkere, som må ændring holdning.” and “Det så vældig godt ud, der var også udleveret henved 100 fakler.” So 264 missing verbs in Wikidata.

Danish verbs in Wikidata and DanNet

Posted on

Besides finding Danish verbs by looking through Danish corpora, see, e.g., Finding Danish verbs in Europarl with Python, you can also find Danish verbs from lexical resources.

Danish Verbs are available in DanNet, the Danish wordnet. These can be extracted from the RDF XML files. Here with a Python script relying on rdflib and a SPARQL query

from rdflib import Graph

filenames = "words part_of_speech".split()

g = Graph()
for filename in filenames:
    g.parse(filename + '.rdf')

query = """
SELECT ?word ?representation {
  ?word dn_schema:partOfSpeech "Verb" .
  ?word wn20schema:lexicalForm ?representation .
}""" 

result = g.query(
    query,
    initNs={
        'dn': 'http://www.wordnet.dk/owl/instance/2009/03/instances/',
        'dn_schema': 'http://www.wordnet.dk/owl/instance/2009/03/schema/',
        'wn20schema': 'http://www.w3.org/2006/03/wn/wn20/schema/',
    })

n = 0
for row in sorted(result, key=lambda row: row[1]):
    if ' ' not in row[1]:
        n += 1
        print("{:11} {}".format(str(row[0])[58:], row[1]))

print("Total number of single-word verbs: {}".format(n))

The full list returns 5581 entries. A number of these verbs include the reflexive “sig” (e.g., “forvente sig”, “foræde sig”) or adverbs (e.g., “abe efter”, “arbejde over”). Excluding these (' ' not in row[1]) one ends up with 3994 single words. Among those words are a number I do not recall having run into before: pizzikere, afhjelme, indsukre, …

In Wikidata, the DanNet word identifier may be entered using the P6140 property.

Not all Danish verbs are in DanNet. In the beginning of January 2022, there were 2862 Danish verbs in Wikidata. This number ranks Danish as number 11 in terms of number of verbs entered, see Ordia’s statistics where Indonesian is ahead with 12,781 verbs followed by the 7,932 of Estonian. In Wikidata, I have marked verbs not available in DanNet with Wikidata’s novalue, so they can be queried with the Wikidata Query Service and this SPARQL query

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 ;
          a wdno:P6140 .
}

In January 2022, there were around 580 Danish verbs in Wikidata that were marked as not being in DanNet, e.g., dibse, indebære, erfare, misforstå and opgradere. Such words are fairly common, not like DanNet’s “afhjelme”, – a word that can cause confusion according to a text at the Danish parliament website.

Not all of Wikidata’s Danish verbs have indicated whether the word is in DanNet or not in DanNet. Wikidata Query Service may tell us about these words

SELECT ?lexeme ?lemma {
  ?lexeme wikibase:lemma ?lemma ;
          dct:language wd:Q9035 ;
          wikibase:lexicalCategory wd:Q24905 .
  FILTER NOT EXISTS { ?lexeme p:P6140 [] }
}

As of 6 January 2022, there were 387 Danish verbs where the DanNet status was not indicated. These instances may be because I forgot to type it in, a typo or it was difficult. For “difficult” consider the word “sigte”. It has three DanNet verb homograph lexemes (aim, charge, sieve). The Wikidata lexeme L207570 had mixed these lexemes through links to the lexemes tilsigte and sigtelse, – as far as I can tell.

GitHub statistics for DTU Compute researchers

Posted on

DTU Compute is a department at the Technical University of Denmark. How many followers does DTU Compute researchers have? This Python code attempts to answer that question.

Code

This is based on the code at my previous blogpost Female GitHubbers.

import re
import requests
import pandas as pd
from time import sleep

query = """
SELECT
  ?researcher ?researcherLabel ?researcherDescription
  (SAMPLE(?github_) AS ?github)
WITH {
  SELECT DISTINCT ?researcher WHERE {
    ?researcher ( wdt:P108 | wdt:P463 | wdt:P1416 ) / wdt:P361* wd:Q23048689 . 
  } 
} AS %researchers
WHERE {
  INCLUDE %researchers
  ?researcher wdt:P2037 ?github_ .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,es,fr,nl,no,ru,sv,zh" . } 
}
GROUP BY ?researcher ?researcherLabel ?researcherDescription 
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.json_normalize(response.json()['results']['bindings'])


URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    print(url)
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']     
    except:
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)


researchers['followers'] = followers
columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']

print(researchers.sort_values(['followers'], ascending=False)[columns].to_html(index=False))

Results

As is apparent from the code, the statistics is based on the labelling in Wikidata which is likely incomplete, meaning that people not occurring on the list might not be on GitHub, may not be recorded in Wikidata as on GitHub, or may not be recorded as being associated with DTU Compute.

followersgithub.valueresearcherLabel.valueresearcher.value
555andersbllAnders Boesen Lindbo Larsenhttp://www.wikidata.org/entity/Q28829121
537rasmusbergpalmRasmus Berg Palmhttp://www.wikidata.org/entity/Q42189239
179fnielsenFinn Årup Nielsenhttp://www.wikidata.org/entity/Q20980928
79SkafteNickiNicki Skafte Detlefsenhttp://www.wikidata.org/entity/Q57080372
57larsmaaloeeLars Maaløehttp://www.wikidata.org/entity/Q29016760
35h0pbeatArkadiusz Stopczynskihttp://www.wikidata.org/entity/Q28045211
33sunemanSune Lehmannhttp://www.wikidata.org/entity/Q24390693
26pmorenozPablo Moreno-Muñozhttp://www.wikidata.org/entity/Q90363825
24janbaJakob Andreas Bærentzenhttp://www.wikidata.org/entity/Q25939218
19apengsigkarupAllan P. Engsig-Karuphttp://www.wikidata.org/entity/Q24449285
16baekgaardPer Bækgaardhttp://www.wikidata.org/entity/Q28045348
12BonnevieRasmus Bonneviehttp://www.wikidata.org/entity/Q28681282
8bardramJakob E. Bardramhttp://www.wikidata.org/entity/Q24389216
5sfvnielsenSøren Føns Vind Nielsenhttp://www.wikidata.org/entity/Q28948833
5SThereseSofie Therese Hansenhttp://www.wikidata.org/entity/Q28477916
4MichaelRiisMichael Riis Andersenhttp://www.wikidata.org/entity/Q30169795
3bjarkemoenstedBjarke Mønstedhttp://www.wikidata.org/entity/Q56120499
1KristofferAlbersKristoffer Jon Albershttp://www.wikidata.org/entity/Q28845839
1michaelkaipetersenMichael Kai Petersenhttp://www.wikidata.org/entity/Q24573646
0andrea-cuttoneAndrea Cuttonehttp://www.wikidata.org/entity/Q28045195
0jakobeglarsenJakob Eg Larsenhttp://www.wikidata.org/entity/Q25931571
0North-GuardJeppe Nørregaardhttp://www.wikidata.org/entity/Q44738835
0laura-riegerLaura Riegerhttp://www.wikidata.org/entity/Q48975801
0letizia-marchegianiLetizia Marchegianihttp://www.wikidata.org/entity/Q56363348
0ekkartEkkart Kindlerhttp://www.wikidata.org/entity/Q25939354

The two top listings, Anders Boesen Lindbo Larsen and Rasmus Berg Palm, are no longer at DTU Compute, as far as I can tell.

First experiments with the T0 Hugging Face language model

Posted on

The T0 models was released here in October 2021, available via Hugging Face, see bigscience/T0pp, and described in the paper Multitask Prompted Training Enables Zero-Shot Task Generalization (Scholia). The researchers behind the model claims “The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.”

The small language model, T0_3B, contains 3 milliard parameters and fills up 11 gigabyte of disk space at ~/.cache/huggingface/transformers/a80e28…

After setup of protobuf, torch and transformers, the model can be autodownloaded and test can be run. On the Hugging Face webpage, there is a few lines of Python code with a sentiment analysis example, here converted to use the small model and edited:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt"))[0]))
<pad> Positive</s>

It is unclear to me how well these large pre-trained language models speak other languages than English. My knowledge of prompt engineering is also limited. So the below examples are my first-shot naive attempts:

>>> print(tokenizer.decode(model.generate(tokenizer.encode("Hvem er statsminister i Danmark?", return_tensors="pt"))[0]))
<pad> <unk>ystein Svensson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Danish:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Translate to French:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("What city is the second largest in Denmark?", return_tensors="pt"))[0]))
<pad> Copenhagen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote Der er et yndigt land?", return_tensors="pt"))[0]))
<pad> Theodore Roosevelt</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote the song 'She loves you'?", return_tensors="pt"))[0]))
<pad> John Lennon and Paul McCartney</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote 'Der er et yndigt land'?", return_tensors="pt"))[0]))
<pad> <unk>sgeir <unk>sgeirsson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: Who wrote 'Der er et yndigt land'? Answer:", return_tensors="pt"))[0]))
<pad> Henrik Ibsen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: What city is the second largest in Denmark? Answer:", return_tensors="pt"))[0]))
<pad> Copenhagen</s>

Each of these answers takes over 20 seconds to complete on the initial CPU-based system I used.

On the non-Danish question “Who wrote the song ‘She loves you’?” the model gets the answer right, but for the Danish questions it fails.

For the question “Who wrote Der er et yndigt land?”, i.e., who wrote the Danish national anthem, T0_3B answers incorrectly Theodore Roosevelt or Henrik Ibsen depending on the prompt, while the Google search engine returns “Adam Oehlenschläger” for me.

The question can be converted to SPARQL for submission to the Wikidata Query Service:

SELECT ?who {
  ?work rdfs:label 'Der er et yndigt land'@da ;   
        ( wdt:P50 | wdt:P676 | wdt:P86 | wdt:P58 | wdt:P2679 | wdt:P2680 ) / rdfs:label ?who .
  FILTER (LANG(?who) = 'en')
}

The result there is

Adam Oehlenschläger
Morten Arnfred
Jørgen Ljungdalh
Hans Ernst Krøyer

Oehlenschläger is the author of the text, Krøyer the composer, Arnfred and Ljungdalh is the screenwriter of a Danish film with the same title as the anthem.

Simon Razniewski (Scholia), Gerhard Weikum (Scholia) and et al. have recently published their DL4KG 2021 paper Language Models As or For Knowledge Bases (Scholia) where they contemplate over the limitations and advantages of language models vs. knowledge bases/graphs. They have had access to the GPT-3 language model:

Example: GPT-3 does not have tangible knowledge that Alan Turing was born in London; it merely assigns this a high confidence of 83%. Yann LeCun, on the other hand, is given medium confidence in being a citizen of France and Canada (67% and 26%), but he actually has French and USA citizenship, not Canadian. The LM assigns USA a very low score. The Wikidata KB, on the other hand, only states his French citizenship, not USA. Wikidata is incomplete, but it does not contain any errors.

Language Models As or For Knowledge Bases, page 2

Danish speech recognition April 2021

Posted on

Danish speech recognition is a developing field. There is a general concern about the lack of good open annotated speech data to train from and within the sprogteknologi.dk project there is an effort to help establishing more data. The Awesome Danish list records not much of spoken language corpora. Mozilla’s Common Voice has still not been set up for Danish. NST is conveniently licensed under CC0 and the 22kHz speech corpora is 6.7 GB. An example sentence is “Da jeg havde travlt og blev utålmodig ,<komma> tilbød jeg at betale for frikadellerne .<punktum>”. The punctuation is read aloud. In Python, the raw 16 bit audio files may be read with A = np.fromfile('ST1/080601/003_PSA/DA_PSA01.001', dtype='>i2'), – as far as I can determine.

The speech recognition tool Danspeech is a work of Lars Kai Hansen‘s students, Martin Carsten Nielsen and Rasmus Arpe Fogh Jensen, and is available as a Python package. There is an associated demo repository. I was able – with a bit of hassle – to install danspeech and danspeechdemo in a Python3.7 virtual environment with clones from the GitHub repositories, – not the danspeech version on the cheeseshop (though a fix may now be in place). The demo brings up a webpage where sound can be recorded and transcribed.

My first attempt was a sentence from a psychiatry book with a good number of unusual words: “En tilstand efter indtagelse af et psykoaktivt stof, som medfører forstyrrelser af bevidsthedsniveau, kognitive funktioner, perception, affekt, adfærd eller andre psykofysiologiske funktioner og reaktioner.” It is not clear to me which kind of configuration is the best for such a sentence and my pronounciation may not be the best. The transcription below makes an error for the form of “psykoaktive/t”, “affect/effect” and the unusual word “psykofysiologiske/psykosocial giske”, where “giske” is not a Danish word. And punctuation is missing.

Google’s speech-to-text is another tool for Danish speech recognition. The service is freemium with a demo on its webpage. The transcription of the same sentence as above – but perhaps spoken slightly different – is shown. I was not able to let it show the full sentence.

Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table

Code

import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])

URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']
    except: 
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))

Results

The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followersgithub.valueresearcherLabel.valueresearcher.value
1675jennybcJennifer Bryanhttp://www.wikidata.org/entity/Q40579104
1299jesstessJessica McKellarhttp://www.wikidata.org/entity/Q19667922
475triketoraTracy Chouhttp://www.wikidata.org/entity/Q24238925
347olgabotOlga B. Botvinnikhttp://www.wikidata.org/entity/Q44163048
124vsochVanessa V. Sochathttp://www.wikidata.org/entity/Q30133235
84brainwaneSumana Harihareswarahttp://www.wikidata.org/entity/Q18912181
75lydiapintscherLydia Pintscherhttp://www.wikidata.org/entity/Q18016466
56agbeltranAlejandra González-Beltránhttp://www.wikidata.org/entity/Q27824575
22frimelleLucie-Aimée Kaffeehttp://www.wikidata.org/entity/Q37860261
21isabelleaugensteinIsabelle Augensteinhttp://www.wikidata.org/entity/Q30338957
20cnapCourtney Napoleshttp://www.wikidata.org/entity/Q42797251
15tudoracheTania Tudorachehttp://www.wikidata.org/entity/Q29053249
13vedinaNina Jeliazkovahttp://www.wikidata.org/entity/Q27061849
11mkutmonMartina Summer-Kutmonhttp://www.wikidata.org/entity/Q27987764
7caoylerCatalina Wilmershttp://www.wikidata.org/entity/Q38915853
7esterpantaleoEster Pantaleohttp://www.wikidata.org/entity/Q28949490
6NuriaQueraltNúria Queralt Rosinachhttp://www.wikidata.org/entity/Q29644228
2rongwangnuRong Wanghttp://www.wikidata.org/entity/Q35178434
2lschiffLisa Schiffhttp://www.wikidata.org/entity/Q38916007
1SigridKSigrid Klerkehttp://www.wikidata.org/entity/Q28152723
1amrapalijzAmrapali Zaverihttp://www.wikidata.org/entity/Q34315853
1mesbahsSepideh Mesbahhttp://www.wikidata.org/entity/Q30098458
1ChristineChichesterChristine Chichesterhttp://www.wikidata.org/entity/Q19845665
1BinaryStarsShima Dastgheibhttp://www.wikidata.org/entity/Q42091042
1mollymkingMolly M. Kinghttp://www.wikidata.org/entity/Q40705344
0jannahastingsJanna Hastingshttp://www.wikidata.org/entity/Q27902110
0nmjakobsenNina Munkholt Jakobsenhttp://www.wikidata.org/entity/Q38674430

Danish stopword lists

Posted on Updated on

Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with

>>> import nltk
>>> nltk.corpus.stopwords.words('danish')

MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from

>>> from spacy.lang.da.stop_words import STOP_WORDS

SpaCy 2.03 has 219 words in that list.

MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at https://github.com/stopwords-iso.

The Snowball stemmer has 94 words at http://snowball.tartarus.org/algorithms/danish/stop.txt.

In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:

> install.packages("tm")
> library(tm)
> stopwords(kind="da")

The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:

import re
import nltk
import requests

url = "http://snowball.tartarus.org/algorithms/danish/stop.txt"
snowball_stopwords = re.findall('^(\w+)', requests.get(url).text,
                                flags=re.MULTILINE | re.UNICODE)
nltk_stopwords = nltk.corpus.stopwords.words('danish')
snowball_stopwords == nltk_stopwords

A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.