Python

Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table

Code

 

import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])

URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']
    except: 
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))

Results

The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan http://www.wikidata.org/entity/Q40579104
1299 jesstess Jessica McKellar http://www.wikidata.org/entity/Q19667922
475 triketora Tracy Chou http://www.wikidata.org/entity/Q24238925
347 olgabot Olga B. Botvinnik http://www.wikidata.org/entity/Q44163048
124 vsoch Vanessa V. Sochat http://www.wikidata.org/entity/Q30133235
84 brainwane Sumana Harihareswara http://www.wikidata.org/entity/Q18912181
75 lydiapintscher Lydia Pintscher http://www.wikidata.org/entity/Q18016466
56 agbeltran Alejandra González-Beltrán http://www.wikidata.org/entity/Q27824575
22 frimelle Lucie-Aimée Kaffee http://www.wikidata.org/entity/Q37860261
21 isabelleaugenstein Isabelle Augenstein http://www.wikidata.org/entity/Q30338957
20 cnap Courtney Napoles http://www.wikidata.org/entity/Q42797251
15 tudorache Tania Tudorache http://www.wikidata.org/entity/Q29053249
13 vedina Nina Jeliazkova http://www.wikidata.org/entity/Q27061849
11 mkutmon Martina Summer-Kutmon http://www.wikidata.org/entity/Q27987764
7 caoyler Catalina Wilmers http://www.wikidata.org/entity/Q38915853
7 esterpantaleo Ester Pantaleo http://www.wikidata.org/entity/Q28949490
6 NuriaQueralt Núria Queralt Rosinach http://www.wikidata.org/entity/Q29644228
2 rongwangnu Rong Wang http://www.wikidata.org/entity/Q35178434
2 lschiff Lisa Schiff http://www.wikidata.org/entity/Q38916007
1 SigridK Sigrid Klerke http://www.wikidata.org/entity/Q28152723
1 amrapalijz Amrapali Zaveri http://www.wikidata.org/entity/Q34315853
1 mesbahs Sepideh Mesbah http://www.wikidata.org/entity/Q30098458
1 ChristineChichester Christine Chichester http://www.wikidata.org/entity/Q19845665
1 BinaryStars Shima Dastgheib http://www.wikidata.org/entity/Q42091042
1 mollymking Molly M. King http://www.wikidata.org/entity/Q40705344
0 jannahastings Janna Hastings http://www.wikidata.org/entity/Q27902110
0 nmjakobsen Nina Munkholt Jakobsen http://www.wikidata.org/entity/Q38674430

Danish stopword lists

Posted on Updated on

Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with

>>> import nltk
>>> nltk.corpus.stopwords.words('danish')

MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from

>>> from spacy.lang.da.stop_words import STOP_WORDS

SpaCy 2.03 has 219 words in that list.

MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at https://github.com/stopwords-iso.

The Snowball stemmer has 94 words at http://snowball.tartarus.org/algorithms/danish/stop.txt.

In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:

> install.packages("tm")
> library(tm)
> stopwords(kind="da")

The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:

import re
import nltk
import requests

url = "http://snowball.tartarus.org/algorithms/danish/stop.txt"
snowball_stopwords = re.findall('^(\w+)', requests.get(url).text,
                                flags=re.MULTILINE | re.UNICODE)
nltk_stopwords = nltk.corpus.stopwords.words('danish')
snowball_stopwords == nltk_stopwords

A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.

 

 

Page rank of scientific papers with citation in Wikidata – so far

Posted on Updated on

A citation property has just be created a few hours ago, – and as of writing still not been deleted. It means we can describe citation network, e.g., among scientific papers.

So far we have added a few citations, – mostly from papers about Zika. And now we can plot the citation network or compute the network measures such as page rank.

Below is a Python program using everything with Sparql, Pandas and NetworkX:

statement = """
select ?source ?sourceLabel ?target ?targetLabel where {
  ?source wdt:P2860 ?target .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
} 
"""

service = sparql.Service('https://query.wikidata.org/sparql')
response = service.query(statement)
df = DataFrame(response.fetchall(),
    columns=response.variables)

df.sourceLabel = df.sourceLabel.astype(unicode)
df.targetLabel = df.targetLabel.astype(unicode)

g = nx.DiGraph()
g.add_edges_from(((row.sourceLabel, row.targetLabel)
    for n, row in df.iterrows()))

pr = nx.pagerank(g)
sorted_pageranks = sorted((rank, title)
    for title, rank in pr.items())[::-1]

for rank, title in sorted_pageranks[:10]:
    print("{:.4} {}".format(rank, title[:40]))

The result:

0.02647 Genetic and serologic properties of Zika
0.02479 READemption-a tool for the computational
0.02479 Intrauterine West Nile virus: ocular and
0.02479 Internet encyclopaedias go head to head
0.02479 A juvenile early hominin skeleton from D
0.01798 Quantitative real-time PCR detection of 
0.01755 Zika virus. I. Isolations and serologica
0.01755 Genetic characterization of Zika virus s
0.0175 Potential sexual transmission of Zika vi
0.01745 Zika virus in Gabon (Central Africa)--20

Mixed indexing with integer index in Pandas DataFrame

Posted on Updated on

Indexing in Python’s Pandas can at times be tricky. Here is an example with mixed indexing (.ix) with integer index:

import pandas as pd
df = pd.DataFrame([[1, 2, 's'], [3, 4, 't'], [4, 5, 'u']],
index=[1, 0, 1], columns=['a', 'b', 'c'])
>>> df.a # Correct type
1 1
0 3
1 4
Name: a, dtype: int64
>>> df.loc[0, ['a', 'b']] # Wrong indexing
a 3
b 4
Name: 0, dtype: object
>>> df.ix[0, ['a', 'b']] # Wrong indexing
a 3
b 4
Name: 0, dtype: object
>>> df.iloc[0, :][['a', 'b']] # Correct indexing, wrong type
a 1
b 2
Name: 1, dtype: object
>>> df.loc[:, ['a', 'b']].iloc[0, :] # Correct indexing and type, but long
a 1
b 2
Name: 1, dtype: int64
>>> df.ix[df.index[0], ['a', 'b']] # Ok
a 1
b 2

view raw
mixedindexing.py
hosted with ❤ by GitHub

I ran into the issue when I wanted index with integer for DataFrame representing EEG data in one of its methods

Virtual machine for Python with Vagrant

Posted on Updated on

I may have managed to setup a virtual machine with Vagrant, partially following instructions from the vagrant homepage.

$ sudo aptitude purge vagrant virtualbox virtualbox-dkms virtualbox-qt
$ locate Vagrantfile
$ rm -r ~/.vagrant.d/
$ rm -r ~/virtual
$ rm ~/Vagrantfile 

$ sudo aptitude install vagrant
$ vagrant init hashicorp/precise32
$ vagrant up
There was a problem with the configuration of Vagrant. The error message(s)
are printed below:

vm:
* The box 'hashicorp/precise32' could not be found.

$ vagrant box add precise32 http://files.vagrantup.com/precise32.box
$ ls -l .vagrant.d/boxes/precise32
total 288340
-rw------- 1 fnielsen fnielsen 295237632 Oct  3 15:58 box-disk1.vmdk
-rw------- 1 fnielsen fnielsen     14103 Oct  3 15:58 box.ovf
-rw-r--r-- 1 fnielsen fnielsen       505 Oct  3 15:58 Vagrantfile

$ vagrant up
There was a problem with the configuration of Vagrant. The error message(s)
are printed below:

vm:
* The box 'hashicorp/precise32' could not be found.

$ vagrant box remove precise32
$ vagrant box add precise http://cloud-images.ubuntu.com/vagrant/precise/current/precise-server-cloudimg-i386-vagrant-disk1.box
$ rm Vagrantfile
$ vagrant init precise
$ vagrant up
$ vagrant ssh 
$ uname -a
Linux vagrant-ubuntu-precise-32 3.2.0-69-virtual #103-Ubuntu SMP Tue Sep 2 05:28:41 UTC 2014 i686 i686 i386 GNU/Linux
$ whoami
vagrant


$ sudo aptitude install python-pip
$ sudo pip install numpy
$ sudo aptitude install python-dev
$ sudo pip install numpy
$ python
>>> import numpy
>>> f = open('Hello, virtual world.txt', 'w')
>>> f.write('Hello, virtual world')
>>> f.close()
>>> exit()
$ strings ~/VirtualBox\ VMs/fnielsen_1412345235/box-disk1.vmdk | grep 'Hello, virtual world.txt'
Hello, virtual world.txt
Hello, virtual world.txt

Somewhere inbetween I erased old Virtualbox files in “VirtualBox VMs” directory: “rm -r test_1406195091/” and “rm -r pythoner/”.

Musing over Muse

Posted on

This account details the process of getting a Muse talking:

In Ubuntu’s ‘Bluetooth New Device Setup’ I see after initiating the pairing by pressing 6 seconds on the Muse device button:

Device: 00-06-66-68-9f-ae
Type: Unknown.

I.e., no name and it continues to show ‘Searching for devices…’ with the ‘continue’ button disabled.

With sudo hcidump I get (among the results)

> HCI Event: Extended Inquiry Result (0x2f) plen 255
 bdaddr 00:06:66:68:9F:AE mode 2 clkoffset 0x5b13 class 0x240704 rssi -40
 Unknown type 0x4d with 8 bytes data
 Unknown type 0x00 with 6 bytes data
 Unknown type 0x04 with 9 bytes data

Ubuntu Forum has “11.04 Bluetooth Scanning Endlessly and Not Finding my Phone”
http://ubuntuforums.org/showthread.php?t=1824387 where an answer suggests “modprobe btusb sco rfcomm bnep l2cap”. I have btusb, rfcomm and bnep, but not sco and l2cap. Inspired by another web page http://www.thinkwiki.org/wiki/How_to_setup_Bluetooth we can do:

$ hcitool scan
00:06:66:68:9F:AE Muse
$ sdptool records 00:06:66:68:9F:AE 
Service Name: RN-iAP
Service RecHandle: 0x10000
Service Class ID List:
 "Serial Port" (0x1101)
Protocol Descriptor List:
 "L2CAP" (0x0100)
 "RFCOMM" (0x0003)
 Channel: 1
 "" (0x1200)
Service Name: Wireless iAP
Service RecHandle: 0x10001
Service Class ID List:
 UUID 128: 00000000-deca-fade-deca-deafdecacaff
Protocol Descriptor List:
 "L2CAP" (0x0100)
 "RFCOMM" (0x0003)
 Channel: 2
Language Base Attr List:
 code_ISO639: 0x656e
 encoding: 0x6a
 base_offset: 0x100
$ sudo hcitool info 00:06:66:68:9F:AE
Requesting information ...
 BD Address: 00:06:66:68:9F:AE
 Device Name: Muse
 LMP Version: 3.0 (0x5) LMP Subversion: 0x1a31
 Manufacturer: Cambridge Silicon Radio (10)
 Features page 0: 0xff 0xff 0x8f 0xfe 0x9b 0xff 0x59 0x83
 <3-slot packets> <5-slot packets> <encryption> <slot offset> 
 <timing accuracy> <role switch> <hold mode> <sniff mode> 
 <park state> <RSSI> <channel quality> <SCO link> <HV2 packets> 
 <HV3 packets> <u-law log> <A-law log> <CVSD> <paging scheme> 
 <power control> <transparent SCO> <broadcast encrypt> 
 <EDR ACL 2 Mbps> <EDR ACL 3 Mbps> <enhanced iscan> 
 <interlaced iscan> <interlaced pscan> <inquiry with RSSI> 
 <extended SCO> <EV4 packets> <EV5 packets> <AFH cap. slave> 
 <AFH class. slave> <3-slot EDR ACL> <5-slot EDR ACL> 
 <sniff subrating> <pause encryption> <AFH cap. master> 
 <AFH class. master> <EDR eSCO 2 Mbps> <EDR eSCO 3 Mbps> 
 <3-slot EDR eSCO> <extended inquiry> <simple pairing> 
 <encapsulated PDU> <non-flush flag> <LSTO> <inquiry TX power> 
 <extended features> 
 Features page 1: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
$ hcitool name 00:06:66:68:9F:AE
Muse
$ sudo hcitool cc 00:06:66:68:9F:AE

The ‘Bluetooth New Device Setup’ now manages to get through: It claims that “Paired” is “Yes”, but “Type” is still “Unknown”. The address is set correctly.

With 32-bit bluetooth library installed the muse-io in the SDK can now start:

$ sudo aptitude install libbluetooth3:i386 
$ ./muse-io --preset 14 --device 00:06:66:68:9F:AE --osc osc.udp://localhost:5000

oscdump does not directly work due to Matt hard-coding a directory name.

But still no apparently data in MuseLab. muse-player does not work out of the box.

$ ./muse-player
ImportError: /home/fnielsen/projects/Muse/liblo.so: wrong ELF class: ELFCLASS32

Moving the libraries provided by the SDK and trying again.

$ mkdir attick
$ mv libl* attick/
$ ./muse-player 
 from google.protobuf.internal import enum_type_wrapper
ImportError: cannot import name enum_type_wrapper

The Ubuntu 12.04 version of protobuf apparently does not work. google.__version__ is not set and there is no version number in the code! “dpkg -l python-protobuf” reports 2.4.1-1ubuntu2. “sudo aptitude remove python-protobuf” seems shaky because there are a range of dependencies that looks important, though they only seem to be related to Ubuntu One. pip install protobuf gets into trouble because of version dependencies, so within a virtualenv environment we can do

$ pip install protobuf
$ pip install pyliblo

This may require:

$ sudo aptitude install liblo-dev

Executing muse-player directly in the virtualenv produces an error because hardcoding of the python path (/usr/bin/env python should have been used). Then there is a dependency on Scipy, so Numpy and Scipy should be install in virtualenv:

$ pip install numpy scipy

The bad news is that muse-io requires 32-bit version of libliblo while my muse-player through Python requires 64-bit. The solution seems to be to move muse-io to a directory independent of the Python files and in that directly also put the SDK-provided liblo library files.

$ ./muse-io --preset 14 --device 00:06:66:68:9F:AE --osc osc.udp://localhost:5000
$ python ~/projects/Muse/muse-player -l udp://localhost:5000

These commands produce a continuous output like:

...
Playback Time: 12.3s : Sending Data 1410548303.53 /muse/acc fff 222.66 976.56 50.78
Playback Time: 12.4s : Sending Data 1410548303.55 /muse/acc fff 222.66 976.56 54.69
Playback Time: 12.4s : Sending Data 1410548303.57 /muse/acc fff 222.66 980.47 54.69
...

Zipf plot for word counts in Brown corpus

Posted on

Image

There are various ways of plotting the distribution of highly skewed (heavy-tailed) data, e.g., with a histogram with logarithmically-spaced bins on a log-log plot, or by generating a Zipf-like plot (rank-frequency plot) like the above. This figure uses token count data from the Brown corpus as made available in the NLTK package.

For fitting the Zipf-curve a simple Scipy-based approach is suggested on Stackoverflow by “Evert”. More complicated power-law fitting is implemented on the Python package powerlaw described in Powerlaw: a Python package for analysis of heavy-tailed distributions that is based on the Clauset-paper.

from __future__ import division
from itertools import *
from pylab import *
from nltk.corpus import brown
from string import lower
from collections import Counter
# The data: token counts from the Brown corpus
tokens_with_count = Counter(imap(lower, brown.words()))
counts = array(tokens_with_count.values())
tokens = tokens_with_count.keys()
# A Zipf plot
ranks = arange(1, len(counts)+1)
indices = argsort(counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Brown corpus tokens")
xlabel("Frequency rank of token")
ylabel("Absolute frequency of token")
grid(True)
for n in list(logspace(0.5, log10(len(counts)), 20).astype(int)):
dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
verticalalignment="bottom",
horizontalalignment="left")
show()

view raw
brownzipf.py
hosted with ❤ by GitHub

You can’t fool Python

Posted on

There is this funny thing with Python that allows you to have static variables in functions by putting a mutable object as the default argument.

In Ruby default arguments are evaluated each time the function is called (I am told), so you can make recursive calls with two ruby functions calling each other with the default input arguments:

def func1(f=nil)
print("In func1a\n")
return f
end
def func2(f=func1)
print("In func2a\n")
return f()
end
def func1(f=func2)
print("In func1b\n")
return f()
end
func1()

view raw
defaultarguments.rb
hosted with ❤ by GitHub

Ruby complains that the stack level becomes too deep.

In Python the default argument is evaluated once when the function is defined, so the result of calling one of the Python functions will be different than calling one of the Ruby functions.

def func1(f1=None):
print("In func1a")
return f1
def func2(f2=func1):
print("In func2a")
return f2()
def func1(f1=func2):
print("In func1b")
return f1()
def func2(f2=func1):
print("In func2b")
return f2()
func1()
def func3():
print("In func3a")
return None
def func4():
print("In func4a")
return func3()
def func3():
print("In func4b")
return func4()
# func3() # Yes, it is still possible to call the functions recursively

view raw
defaultarguments.py
hosted with ❤ by GitHub

Graph spectra with NetworkX

Posted on Updated on

Nielsen2012python_graphspectru

I was looking for a value of how clustered a network is. I thought that somewhere in graph spectrum was a good place to start and that in the Python package NetworkX there would be some useful methods. However, I couldn’t immediately see any good methods in NetworkX. Then Morten Mørup mentioned something about community detection and modularity and I became diverged, but now I am back again at the graph spectrum.

The second smallest eigenvalue of the Laplacian matrix of the graph seems to represent reasonably well what I was looking for. Apparently that eigenvalue is called the Algebraic connectivity.

NetworkX has a number of graph generators, and for small test cases the algebraic connectivity seems to give an ok value for how clustered the network is, – or rather how non-clustered it is.

Entertained by scandalous deceiving melancholy, hurrah!

Posted on Updated on

Scatter

I my effort to beat the SentiStrength text sentiment analysis algorithm by Mike Thelwall I came up with a low-hanging fruit killer approach, — I thought. Using the standard movie review data set of Bo Pang available in NLTK (used in research papers as a benchmark data set) I would train an NTLK classifier and compare it with my valence-labeled wordlist AFINN and readjust its weights for the words a little.

What I found, however, was that for a great number of words the sentiment valence between my AFINN word list and the classifier probability trained on the movie reviews were in disagreemet. A word such as ‘distrustful’ I have as a quite negative word. However, the classifier reports the probability for ‘positive’ to be 0.87, i.e., quite positive. I examined where the word ‘distrustful’ occured in the movie review data set:

$ egrep -ir "\bdistrustful\b" ~/nltk_data/corpora/movie_reviews/

The word ‘distrustful’ appears 3 times and in all cases associated with a ‘positive’ movie review. The word is used to describe elements of the narrative or an outside reference rather than the quality of the movie itself. Another word that I have as negative is ‘criticized’. Used 10 times in the positive moview reviews (and none in the negative) I find one negation (‘the casting cannot be criticized’) but mostly the word in a contexts with the reviewer criticizing the critique of others, e.g., ‘many people have criticized fincher’s filming […], but i enjoy and relish in the portrayal’.

The top 15 ‘misaligned’ words using my ad hoc metric are listed here:

 

Diff. Word AFINN Classifier
0.75 hurrah 5 0.25
0.75 motherfucker -5 0.75
0.75 cock -5 0.75
0.68 lol 3 0.12
0.67 distrustful -3 0.87
0.67 anger -3 0.87
0.66 melancholy -2 0.96
0.65 criticized -2 0.95
0.65 bastard -5 0.65
0.65 downside -2 0.95
0.65 frauds -4 0.75
0.65 catastrophic -4 0.75
0.64 biased -2 0.94
0.63 amusements 3 0.17
0.63 worsened -3 0.83

 

It seems that reviewers are interested in movies that have a certain amount of ‘melancholy’, ‘anger’, distrustfulness and (further down the list) scandal, apathy, hoax, struggle, hopelessness and hindrance. Whereas smile, amusement, peacefulness and gratefulness are associated with negative reviews. So are movie reviewers unempathetic schadefreudians entertained by the characters’ misfortune? Hmmm…? It reminds me of journalism where they say “a good story is a bad story”.

So much for philosophy, back to reality:

The words (such as ‘hurrah’) that have a classifier probability on 0.25 and 0.75 typically occure each only once in the corpus. In this application of the classifier I should perhaps have used a stronger prior probability so ‘hurrah’ with 0.25 would end up on around the middle of the scale with 0.5 as the probability. I haven’t checked whether it is possible to readjust the prior in the NLTK naïve Bayes classifier.

The conclusion on my Thelwallizer is not good. A straightforward application of the classifier on the movie reviews gets you features that look on the summary of the narrative rather than movie per se, so this simple approach is not particular helpful in readjustment of the weights.

However, there is another way the trained classifier can be used. Examining the most informative features I can ask if they exist in my AFINN list. The first few missing words are: slip, ludicrous, fascination, 3000, hudson, thematic, seamless, hatred, accessible, conveys, addresses, annual, incoherent, stupidity, … I cannot use ‘hudson’ in my word list, but words such as ludicrous, seamless and incoherent are surely missing.

(28 January 2012: Lookout in the code below! The way the features are constructed for the classifier is troublesome. In NLTK you should not only specify the words that appear in the text with ‘True’ you should also normally specify explicitely the words that do not appear in the text with ‘False’. Not mentioning words in the feature dictionary might be bad depending on the application)

https://gist.github.com/1410094