Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table



import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
response = requests.get("",
                        params={'query': query, 'format': 'json'})
researchers =['results']['bindings'])

URL = ""
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
    url = URL + github
        response = requests.get(url,
        user_followers = response.json()['followers']
        user_followers = 0
    print("{} {}".format(github, followers))

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))


The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan
1299 jesstess Jessica McKellar
475 triketora Tracy Chou
347 olgabot Olga B. Botvinnik
124 vsoch Vanessa V. Sochat
84 brainwane Sumana Harihareswara
75 lydiapintscher Lydia Pintscher
56 agbeltran Alejandra González-Beltrán
22 frimelle Lucie-Aimée Kaffee
21 isabelleaugenstein Isabelle Augenstein
20 cnap Courtney Napoles
15 tudorache Tania Tudorache
13 vedina Nina Jeliazkova
11 mkutmon Martina Summer-Kutmon
7 caoyler Catalina Wilmers
7 esterpantaleo Ester Pantaleo
6 NuriaQueralt Núria Queralt Rosinach
2 rongwangnu Rong Wang
2 lschiff Lisa Schiff
1 SigridK Sigrid Klerke
1 amrapalijz Amrapali Zaveri
1 mesbahs Sepideh Mesbah
1 ChristineChichester Christine Chichester
1 BinaryStars Shima Dastgheib
1 mollymking Molly M. King
0 jannahastings Janna Hastings
0 nmjakobsen Nina Munkholt Jakobsen

Danish stopword lists

Posted on Updated on

Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with

>>> import nltk
>>> nltk.corpus.stopwords.words('danish')

MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from

>>> from spacy.lang.da.stop_words import STOP_WORDS

SpaCy 2.03 has 219 words in that list.

MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at

The Snowball stemmer has 94 words at

In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:

> install.packages("tm")
> library(tm)
> stopwords(kind="da")

The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:

import re
import nltk
import requests

url = ""
snowball_stopwords = re.findall('^(\w+)', requests.get(url).text,
                                flags=re.MULTILINE | re.UNICODE)
nltk_stopwords = nltk.corpus.stopwords.words('danish')
snowball_stopwords == nltk_stopwords

A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.



Page rank of scientific papers with citation in Wikidata – so far

Posted on Updated on

A citation property has just be created a few hours ago, – and as of writing still not been deleted. It means we can describe citation network, e.g., among scientific papers.

So far we have added a few citations, – mostly from papers about Zika. And now we can plot the citation network or compute the network measures such as page rank.

Below is a Python program using everything with Sparql, Pandas and NetworkX:

statement = """
select ?source ?sourceLabel ?target ?targetLabel where {
  ?source wdt:P2860 ?target .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .

service = sparql.Service('')
response = service.query(statement)
df = DataFrame(response.fetchall(),

df.sourceLabel = df.sourceLabel.astype(unicode)
df.targetLabel = df.targetLabel.astype(unicode)

g = nx.DiGraph()
g.add_edges_from(((row.sourceLabel, row.targetLabel)
    for n, row in df.iterrows()))

pr = nx.pagerank(g)
sorted_pageranks = sorted((rank, title)
    for title, rank in pr.items())[::-1]

for rank, title in sorted_pageranks[:10]:
    print("{:.4} {}".format(rank, title[:40]))

The result:

0.02647 Genetic and serologic properties of Zika
0.02479 READemption-a tool for the computational
0.02479 Intrauterine West Nile virus: ocular and
0.02479 Internet encyclopaedias go head to head
0.02479 A juvenile early hominin skeleton from D
0.01798 Quantitative real-time PCR detection of 
0.01755 Zika virus. I. Isolations and serologica
0.01755 Genetic characterization of Zika virus s
0.0175 Potential sexual transmission of Zika vi
0.01745 Zika virus in Gabon (Central Africa)--20

Mixed indexing with integer index in Pandas DataFrame

Posted on Updated on

Indexing in Python’s Pandas can at times be tricky. Here is an example with mixed indexing (.ix) with integer index:

I ran into the issue when I wanted index with integer for DataFrame representing EEG data in one of its methods

Virtual machine for Python with Vagrant

Posted on Updated on

I may have managed to setup a virtual machine with Vagrant, partially following instructions from the vagrant homepage.

$ sudo aptitude purge vagrant virtualbox virtualbox-dkms virtualbox-qt
$ locate Vagrantfile
$ rm -r ~/.vagrant.d/
$ rm -r ~/virtual
$ rm ~/Vagrantfile 

$ sudo aptitude install vagrant
$ vagrant init hashicorp/precise32
$ vagrant up
There was a problem with the configuration of Vagrant. The error message(s)
are printed below:

* The box 'hashicorp/precise32' could not be found.

$ vagrant box add precise32
$ ls -l .vagrant.d/boxes/precise32
total 288340
-rw------- 1 fnielsen fnielsen 295237632 Oct  3 15:58 box-disk1.vmdk
-rw------- 1 fnielsen fnielsen     14103 Oct  3 15:58 box.ovf
-rw-r--r-- 1 fnielsen fnielsen       505 Oct  3 15:58 Vagrantfile

$ vagrant up
There was a problem with the configuration of Vagrant. The error message(s)
are printed below:

* The box 'hashicorp/precise32' could not be found.

$ vagrant box remove precise32
$ vagrant box add precise
$ rm Vagrantfile
$ vagrant init precise
$ vagrant up
$ vagrant ssh 
$ uname -a
Linux vagrant-ubuntu-precise-32 3.2.0-69-virtual #103-Ubuntu SMP Tue Sep 2 05:28:41 UTC 2014 i686 i686 i386 GNU/Linux
$ whoami

$ sudo aptitude install python-pip
$ sudo pip install numpy
$ sudo aptitude install python-dev
$ sudo pip install numpy
$ python
>>> import numpy
>>> f = open('Hello, virtual world.txt', 'w')
>>> f.write('Hello, virtual world')
>>> f.close()
>>> exit()
$ strings ~/VirtualBox\ VMs/fnielsen_1412345235/box-disk1.vmdk | grep 'Hello, virtual world.txt'
Hello, virtual world.txt
Hello, virtual world.txt

Somewhere inbetween I erased old Virtualbox files in “VirtualBox VMs” directory: “rm -r test_1406195091/” and “rm -r pythoner/”.

Musing over Muse

Posted on

This account details the process of getting a Muse talking:

In Ubuntu’s ‘Bluetooth New Device Setup’ I see after initiating the pairing by pressing 6 seconds on the Muse device button:

Device: 00-06-66-68-9f-ae
Type: Unknown.

I.e., no name and it continues to show ‘Searching for devices…’ with the ‘continue’ button disabled.

With sudo hcidump I get (among the results)

> HCI Event: Extended Inquiry Result (0x2f) plen 255
 bdaddr 00:06:66:68:9F:AE mode 2 clkoffset 0x5b13 class 0x240704 rssi -40
 Unknown type 0x4d with 8 bytes data
 Unknown type 0x00 with 6 bytes data
 Unknown type 0x04 with 9 bytes data

Ubuntu Forum has “11.04 Bluetooth Scanning Endlessly and Not Finding my Phone” where an answer suggests “modprobe btusb sco rfcomm bnep l2cap”. I have btusb, rfcomm and bnep, but not sco and l2cap. Inspired by another web page we can do:

$ hcitool scan
00:06:66:68:9F:AE Muse
$ sdptool records 00:06:66:68:9F:AE 
Service Name: RN-iAP
Service RecHandle: 0x10000
Service Class ID List:
 "Serial Port" (0x1101)
Protocol Descriptor List:
 "L2CAP" (0x0100)
 "RFCOMM" (0x0003)
 Channel: 1
 "" (0x1200)
Service Name: Wireless iAP
Service RecHandle: 0x10001
Service Class ID List:
 UUID 128: 00000000-deca-fade-deca-deafdecacaff
Protocol Descriptor List:
 "L2CAP" (0x0100)
 "RFCOMM" (0x0003)
 Channel: 2
Language Base Attr List:
 code_ISO639: 0x656e
 encoding: 0x6a
 base_offset: 0x100
$ sudo hcitool info 00:06:66:68:9F:AE
Requesting information ...
 BD Address: 00:06:66:68:9F:AE
 Device Name: Muse
 LMP Version: 3.0 (0x5) LMP Subversion: 0x1a31
 Manufacturer: Cambridge Silicon Radio (10)
 Features page 0: 0xff 0xff 0x8f 0xfe 0x9b 0xff 0x59 0x83
 <3-slot packets> <5-slot packets> <encryption> <slot offset> 
 <timing accuracy> <role switch> <hold mode> <sniff mode> 
 <park state> <RSSI> <channel quality> <SCO link> <HV2 packets> 
 <HV3 packets> <u-law log> <A-law log> <CVSD> <paging scheme> 
 <power control> <transparent SCO> <broadcast encrypt> 
 <EDR ACL 2 Mbps> <EDR ACL 3 Mbps> <enhanced iscan> 
 <interlaced iscan> <interlaced pscan> <inquiry with RSSI> 
 <extended SCO> <EV4 packets> <EV5 packets> <AFH cap. slave> 
 <AFH class. slave> <3-slot EDR ACL> <5-slot EDR ACL> 
 <sniff subrating> <pause encryption> <AFH cap. master> 
 <AFH class. master> <EDR eSCO 2 Mbps> <EDR eSCO 3 Mbps> 
 <3-slot EDR eSCO> <extended inquiry> <simple pairing> 
 <encapsulated PDU> <non-flush flag> <LSTO> <inquiry TX power> 
 <extended features> 
 Features page 1: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
$ hcitool name 00:06:66:68:9F:AE
$ sudo hcitool cc 00:06:66:68:9F:AE

The ‘Bluetooth New Device Setup’ now manages to get through: It claims that “Paired” is “Yes”, but “Type” is still “Unknown”. The address is set correctly.

With 32-bit bluetooth library installed the muse-io in the SDK can now start:

$ sudo aptitude install libbluetooth3:i386 
$ ./muse-io --preset 14 --device 00:06:66:68:9F:AE --osc osc.udp://localhost:5000

oscdump does not directly work due to Matt hard-coding a directory name.

But still no apparently data in MuseLab. muse-player does not work out of the box.

$ ./muse-player
ImportError: /home/fnielsen/projects/Muse/ wrong ELF class: ELFCLASS32

Moving the libraries provided by the SDK and trying again.

$ mkdir attick
$ mv libl* attick/
$ ./muse-player 
 from google.protobuf.internal import enum_type_wrapper
ImportError: cannot import name enum_type_wrapper

The Ubuntu 12.04 version of protobuf apparently does not work. google.__version__ is not set and there is no version number in the code! “dpkg -l python-protobuf” reports 2.4.1-1ubuntu2. “sudo aptitude remove python-protobuf” seems shaky because there are a range of dependencies that looks important, though they only seem to be related to Ubuntu One. pip install protobuf gets into trouble because of version dependencies, so within a virtualenv environment we can do

$ pip install protobuf
$ pip install pyliblo

This may require:

$ sudo aptitude install liblo-dev

Executing muse-player directly in the virtualenv produces an error because hardcoding of the python path (/usr/bin/env python should have been used). Then there is a dependency on Scipy, so Numpy and Scipy should be install in virtualenv:

$ pip install numpy scipy

The bad news is that muse-io requires 32-bit version of libliblo while my muse-player through Python requires 64-bit. The solution seems to be to move muse-io to a directory independent of the Python files and in that directly also put the SDK-provided liblo library files.

$ ./muse-io --preset 14 --device 00:06:66:68:9F:AE --osc osc.udp://localhost:5000
$ python ~/projects/Muse/muse-player -l udp://localhost:5000

These commands produce a continuous output like:

Playback Time: 12.3s : Sending Data 1410548303.53 /muse/acc fff 222.66 976.56 50.78
Playback Time: 12.4s : Sending Data 1410548303.55 /muse/acc fff 222.66 976.56 54.69
Playback Time: 12.4s : Sending Data 1410548303.57 /muse/acc fff 222.66 980.47 54.69

Zipf plot for word counts in Brown corpus

Posted on


There are various ways of plotting the distribution of highly skewed (heavy-tailed) data, e.g., with a histogram with logarithmically-spaced bins on a log-log plot, or by generating a Zipf-like plot (rank-frequency plot) like the above. This figure uses token count data from the Brown corpus as made available in the NLTK package.

For fitting the Zipf-curve a simple Scipy-based approach is suggested on Stackoverflow by “Evert”. More complicated power-law fitting is implemented on the Python package powerlaw described in Powerlaw: a Python package for analysis of heavy-tailed distributions that is based on the Clauset-paper.