Month: November 2017

Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table



import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
response = requests.get("",
                        params={'query': query, 'format': 'json'})
researchers =['results']['bindings'])

URL = ""
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
    url = URL + github
        response = requests.get(url,
        user_followers = response.json()['followers']
        user_followers = 0
    print("{} {}".format(github, followers))

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))


The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan
1299 jesstess Jessica McKellar
475 triketora Tracy Chou
347 olgabot Olga B. Botvinnik
124 vsoch Vanessa V. Sochat
84 brainwane Sumana Harihareswara
75 lydiapintscher Lydia Pintscher
56 agbeltran Alejandra González-Beltrán
22 frimelle Lucie-Aimée Kaffee
21 isabelleaugenstein Isabelle Augenstein
20 cnap Courtney Napoles
15 tudorache Tania Tudorache
13 vedina Nina Jeliazkova
11 mkutmon Martina Summer-Kutmon
7 caoyler Catalina Wilmers
7 esterpantaleo Ester Pantaleo
6 NuriaQueralt Núria Queralt Rosinach
2 rongwangnu Rong Wang
2 lschiff Lisa Schiff
1 SigridK Sigrid Klerke
1 amrapalijz Amrapali Zaveri
1 mesbahs Sepideh Mesbah
1 ChristineChichester Christine Chichester
1 BinaryStars Shima Dastgheib
1 mollymking Molly M. King
0 jannahastings Janna Hastings
0 nmjakobsen Nina Munkholt Jakobsen

Danish stopword lists

Posted on Updated on

Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with

>>> import nltk
>>> nltk.corpus.stopwords.words('danish')

MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from

>>> from spacy.lang.da.stop_words import STOP_WORDS

SpaCy 2.03 has 219 words in that list.

MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at

The Snowball stemmer has 94 words at

In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:

> install.packages("tm")
> library(tm)
> stopwords(kind="da")

The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:

import re
import nltk
import requests

url = ""
snowball_stopwords = re.findall('^(\w+)', requests.get(url).text,
                                flags=re.MULTILINE | re.UNICODE)
nltk_stopwords = nltk.corpus.stopwords.words('danish')
snowball_stopwords == nltk_stopwords

A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.



Find titles of all works published by DTU Cognitive Systems in 2017

Posted on Updated on

Find titles of all works published by DTU Cognitive Systems in 2017! How difficult can that be? To identify all titles of works from a research organization? With Wikidata and the Wikidata Query Service (WDQS) at hand it shouldn’t be that difficult to do? Nevertheless, I ran into a few hatches:

  1. There is what we can call the “Nathan Churchill Problem”: Nathan Churchill was at one point affiliated with our research section Cognitive Systems and wrote papers, e.g., together with our Morten Mørup. One paper clearly identifies him as affiliated with our section. Searching the DTU website yields no homepage for him though. He is now at St. Michael’s Hospital, Toronto according to a newer paper. So is he no longer affiliated with the Cognitive Systems section? That’s somewhat difficult to establish with credible and citable sources. If he is not, then any simple SPARQL query on the WDQS for Cognitive Systems papers will yield his new papers which shouldn’t be counted as Cognitive Systems section papers. If we could point to a source that indicates whether his affiliation at our section is stopped we could add a qualifier to the P1416 property in his Wikidata entry and extend the SPARQL query. What I ended up doing, was to explicitly filter out two of Churchill’s publications with the ugly line “FILTER(?work != wd:Q42595201 && ?work != wd:Q36384548)“. The problem is of course not just confined to Churchill. For instance, Scholia currently lists new publications by our Søren Hauberg at the Scholia page for DIKU, – a department where he has previously been affiliated. We discussed the affiliation problem a bit in the Scholia paper, see page 253 (page 17).
  2. Datetime datatype conversion with xsd:dateTime. The filter on date is with this line: “FILTER(?publication_datetime >= "2017-01-01"^^xsd:dateTime)“. Something like “FILTER(?publication_datetime >= xsd:dateTime(2017))” does not work.
  3. Missing data. It is difficult to establish how complete the Wikidata listing is for our section with respect to publications. Scraping Google Scholar, PubMed and our local university database of publications could be a possibility, but this is far from streamlined with the tools I have developed.

The full query is listed below and the result is available from this link. Currently, 48 results are returned.

SELECT ?workLabel 
    ?work (MIN(?publication_datetime) AS ?datetime)
    # Find CogSys work
    ?researcher wdt:P108 | wdt:P463 | wdt:P1416/wdt:P361* wd:Q24283660 .
    ?work wdt:P50 ?researcher .
    ?work wdt:P31 wd:Q13442814 .
    # Nathan Churchill seems not longer to be affiliated!?
    FILTER(?work != wd:Q42595201 && ?work != wd:Q36384548)
    # Filter to year 2017
    ?work wdt:P577 ?publication_datetime .
    FILTER(?publication_datetime >= "2017-01-01"^^xsd:dateTime)
  GROUP BY ?work 
} AS %results
  INCLUDE %results
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,es,fr,jp,nl,nl,ru,zh". }


Can you scrape Google Scholar?

Posted on

With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g., CrossRef may be another resource.

Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.

The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.

The Google robots.txt limits automated access with the following relevant lines:

Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?*cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholar_share

“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.

So if there is some information you can get from Google Scholar is it worth it?

The Scholia code now adds a module with some preliminary Google Scholar processing attempts. There is command-line based scraping of a researcher profile. For instance,

python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ

It ain’t not working too well. As far as I can determine you need to page with JavaScript to get more than the initial 20 results (it would be interesting to examine the Publish or Perish software to see how a larger set of results is obtained). Not all bibliographic metadata is available for each item on the Google Scholar page – as far as I see: No DOI. No PubMed identifier. The author list may be abbreviated with an ellipsis (‘…’). Matching of the Google Scholar item with data already present in Wikidata seems not that straightforward.

It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.