How to quickly generate word analogy datasets with Wikidata

Posted on Updated on

One popular task in computational linguistics/natural language processing is the word analogy task: Copenhagen is to Denmark as Berlin is to …?

With queries to Wikidata Query Service (WDQS) it is reasonably easy to generate word analogy datasets in whatever (Wikidata-supported) language you like. For instance, for capitals and countries, a WDQS SPARQL query that returns results in Danish could go like this:

  ?country1Label ?capital1Label
  ?country2Label ?capital2Label
where { 
  ?country1 wdt:P36 ?capital1 .
  ?country1 wdt:P463 wd:Q1065 .
  ?country1 wdt:P1082 ?population1 .
  filter (?population1 > 5000000)
  ?country2 wdt:P36 ?capital2 .
  ?country2 wdt:P463 wd:Q1065 .
  ?country2 wdt:P1082 ?population2 .
  filter (?population2 > 5000000)
  filter (?country1 != ?country2)
  service wikibase:label
    { bd:serviceParam wikibase:language "da". }  
limit 1000

Follow this link to get to the query and press “Run” to get the results. It is possible to download the table as CSV-formatted (see under “Download”). One issue to note that you have multiple entries for countries with multiple capital cities, e.g., Sydafrika (South Africa) is listed with Pretoria, Kapstaden (Cape Town) and Bloemfontein.

How much does it cost to buy all my scientific articles?

Posted on Updated on

How much does it cost to buy all my scientific articles?

Disregarding the slight difference in exchange rate between the current Euro and USD the answer is around 1’200 USD/Euros. That is the amount of money I would have to pay to download all the scientific articles I have been involved in, – if I did not have access to a university library with subscription. I have signed off the copyright to many articles to a long string of publishers, Elsevier, Wiley, IEEE, Springer, etc., and I no longer control the publication.

I have added a good number of my articles to Wikidata including the price for each article. The SPARQL-based Wikidata Query Service is able to generate a table with the price information, see here. The total sum is also available after a slight modication of the SPARQL query.

The Wikidata Query Service can also generate plots, for instance, of the price per page as a function of the publication date (choose “Graph builder” under “Display”). In the plot below the unit (currency) is mixed USD and Euro. (there seem to be  an issue with the shapes in the legend)


Something like 3 to 4 USD/Euros per page seems to what an eyesight averaging comes to.

Among the most expensive articles are the ones from the journal Neuroinformatics published by Springer: 43.69 Euros for each article. Wiley articles cost 38 USD and the Elsevier articles around 36 USD. The Association for Computing Machinery sells their articles for only 15 USD. A bargain.

It may be difficult to find the price of the articles. Science claims that “Science research is available free with registration one year after initial publication.” However, I was not able to get to the full text for The Real Power of Artificial Markets on the Science website. On one page you can stubble onto this: “Purchase Access to this Article for 1 day for US$30.00” and that is what I put into Wikidata. The article is fairly short so this price makes it the priciest article per page.


I ought to write something discerning about the state of scientific publishing. However, I will instead redirect you to a recent blog post by Tal Yarkoni.

“Overzealous business types”?

Posted on Updated on

The University of Copenhagen and its problematic dismissal of notable scientist Hans Thybo have now landed in an editorial of Nature: “Corporate culture spreads to Scandinavia“. Their concluding claim is that “the threat is the colonization of universities by overzealous business types” (against academic freedom).

Interestingly, though the majority of the university board members is required by law to be from outside the university (not necessarily business), the university management has usually an academic background. And this is also the case for the management around Hans Thybo:

  1. The head of department for Hans Thybo is Claus Beier, see “Hans Thybos institutleder om fyringssagen“. Beier is a PhD and a professor with a long series of publications in climate change as can be studied on Google Scholar.
  2. Dean is John Renner Hansen, see “KU spildte ½ million på konsulentundersøgelse af Thybo for misbrug af forskningsmidler“. He is also researcher and claims to have “Approximately 600 publications in international refereed journals”
  3. Head of the university is Ralf Hemmingsen that I know as a notable researcher in psychiatry.

I am not convinced by the arguments in the Nature editorial which sets up “business types” against academics. I think that the case should rather be seen against the background of the case with Milena Penkowa and another story around the possible abuse of research funds on the Copenhagen University Hospital, see “Ny sag om fusk med penge til forskning“.

The Wikidata scholarly profile page

Posted on Updated on


Recently Lambert Heller wrote an overview piece on websites for scholarly profile pages: “What will the scholarly profile page of the future look like? Provision of metadata is enabling experimentation“. There he tabularized the features of the various online sites having scholarly profile pages. These sites include (with links to my entries): ORCID, ResearchGate, Mendeley, Pure and VIVO (don’t know these two), Google Scholar and Impactstory. One site missing from the equation is Wikidata. It can produce scholarly profile pages too. The default Wikidata editing interface may not present the data in a nice way – Magnus Manske’s Reasonator – better, but very much of the functionality is there to make a scholarly profile page.

In terms of the features listed by Heller, I will here list the possible utilization of Wikidata:

  1. Portrait picture: The P18 property can record Wikimedia Commons image related to a researcher. For instance, you can see a nice photo of neuroimaging professor Russ Poldrack.
  2. Researchers alternative names: This is possible with the alias functionality in Wikidata. Poldrack is presently recorded with the canonical label “Russell A. Poldrack” and the alternative names “Russell A Poldrack”, “R. A. Poldrack”, “Russ Poldrack” and “R A Poldrack”. It is straightforward to add more variations
  3. IDs/profiles in other systems: There are absolutely loads of these links in Wikidata. To name a few deep linking posibilities: Twitter, Google Scholar, VIAF, ISNI, ORCID, ResearchGate, GitHub and Scopus. Wikidata is very strong in interlinking databases.
  4. Papers and similar: Papers are presented as items in Wikidata and these items can link to the author via P50. The reverse link is possible with a SPARQL query. Futhermore, on the researcher’s items it is possible to list main works with the appropriate property. Full texts can be linked with the P953 property. PDF of papers with an appropriate compatible license can be uploaded to Wikimedia Commons and/or included in Wikisource.
  5. Uncommon research product: I am not sure what this is, but the developer of software services is recorded in Wikidata. For instance, for the neuroinformatics database OpenfMRI it is specified that Poldrack is the creator. Backlinks are possible with SPARQL queries.
  6. Grants, third party funding. Well there is a sponsor property but how it should be utilized for researchers is not clear. With the property, you can specify that paper or research project were funded by an entity. For the paper The Center for Integrated Molecular Brain Imaging (Cimbi) database you can see that it is funded by the Lundbeck Foundation and Rigshospitalet.
  7. Current institution: Yes. Employer and affiliation property is there for you. You can see an example of an incomplete list of people affiliated with research sections at my department, DTU Compute, here, – automagically generated by the Magnus Manske’s Listeria tool.
  8. Former employers, education etc.: Yes. There is a property for employer and for affiliation and for education. With qualifiers you can specify the dates of employment.
  9. Self assigned keywords: Well, as a Wikidata contributor you can create new items and you can use these items for specifying field of work of to label you paper with main theme.
  10. Concept from controlled vocabulary: Whether Wikidata is a controlled vocabulary is up for discussion. Wikidata items can be linked to controlled vocabularies, e.g., Dewey’s, so there you can get some controlness. For instance, the concept “engineer” in Wikidata is linked the BNCF, NDL, GND, ROME, LCNAF, BNF and FAST.
  11. Social graph of followers/friends: No, that is really not possible on Wikidata.
  12. Social graph of coauthors: Yes, that is possible. With Jonas Kress’ work on D3 enabling graph rendering you got on-the-fly graph rendering in the Wikidata Query Service. You can see my coauthor graph here (it is wobbly at the moment, there is some D3 parameter that need a tweak).
  13. Citation/attention metadata from platform itself: No, I don’t think so. You can get page view data from somewhere on the Wikimedia sites. You can also count the number of citations on-the-fly, – to an author, to a paper, etc.
  14. Citation/attention metadata from other sources: No, not really.
  15. Comprehensive search to match/include own papers: Well, perhaps not. Or perhaps. Magnus Manske’s sourcemd and quickstatement tools allow you to copy-paste a PMID or DOI in a form field press two buttons to grap bibliographic information from PubMed and a DOI source. One-click full paper upload is not well-supported, – to my knowledge. Perhaps Daniel Mietchen knows something about this.
  16. Forums, Q&A, etc.: Well, yes and no. You can use the discussion pages on Wikidata, but these pages are perhaps mostly for discussion of editing, rather than the content of the described item. Perhaps Wikiversity could be used.
  17. Deposit own papers: You can upload appropriately licensed papers to Wikimedia Commons or perhaps Wikisource. Then you can link them from Wikidata.
  18. Research administration tools: No.
  19. Reuse of data from outside the service: You better believe! Although Wikidata is there to be used, a mass download from the Wikidata Query Service can run into timeout problems. To navigate the structure of individual Wikidata item, you need programming skills, – at least for the moment. If you are really desperate you can download the Wikidata dump and Blazegraph and try to setup your own SPARQL server.


So what can we use Wikicite for?

Posted on Updated on


Wikicite is a term for the combination of bibliographic information and Wikidata. While Wikipedia often records books of some notability it rarely records bibliographic information of less notability, i.e., individual scientific articles and books where there little third-party information (reviews, literary analyses, etc.) exists. This is not the case with Wikidata. Wikidata is now beginning to record lots of bibliographic information for “lesser works”. What can we use this treasure trove for? Here are a few of my ideas:

  1. Wikidata may be used as a substitute for a reference manager. I record my own bibliographic information in a big BIBTeX file and use the bibtex program together with latex when I generate a scientific document with references. It might very well be that the job of the BIBTeX file with bibliographic information may be taken over by Wikidata. So far we have, to my knowledge, no proper program for extracting the data in Wikidata and formatting it for inclusion in a document. I have begun a “wibtex” program for this, and only reached 44 lines so far, and it remains to be seen whether this is a viable avenue, whether the structure of Wikidata is good and convenient enough to record data for formatting references or that Wikidata is too flexible or too restricted for this kind of application.
  2. Wikidata may be used for “list of publications” of individual researchers, institutions, research groups and sponsor. Nowadays, I keep a list of publication on a webpage, in a latex document and on Google Scholar. My university has a separate list and sometimes when I write an research application I need to format the data for inclusion in a Microsoft Word document. A flexible program on top of Wikidata could make dynamic lists of publications
  3. Wikidata may be used to count citations. During the Wikicite 2016 Berlin meeting I suggested the P2860 property and Tobias quickly created it. The P2860 allows us to describe citations between items in Wikidata. Though we managed to use the property a bit for scientific articles during the meeting, it has really been James Hare that has been running with the ball. Based on public citation data he has added hundreds of thousands of citations. At the moment this is of course only a very small part of the total number of citations. There are probably tens of millions of scientific papers with each having tens, if not hundreds of citations, of citations, so with the 499,750 citations that James Hare reported on 11 September 2016, we are still far from covering the field: James Hare tweeted that Web of Science claims to have over 1 milliard (billion) citations. The citation counts may be compared to a whole range of context data in Wikidata: author, affiliated institution, journal, year of publication, gender of author and sponsor (funding agency), so we can get, e.g., most cited Dane (or one affiliated with a Danish institution), most cited woman with an image, etc.
  4. Wikidata may be used as a hub for information sources. Individual scientific articles may point to further ressources, such as raw or result data. I myself have, for instance, added links to the neuroinformatics databases OpenfMRI, NeuroVault and Neurosynth, where Wikidata records all papers recorded in OpenfMRI, as far as I can determine. Wikidata is then able to list, say, all OpenfMRI papers or all OpenfMRI authors with Magnus Manske’s Listeria tool.
  5. Wikicite information in Wikidata may be used to support claims in Wikidata itself. As Dario Taraborelli points out this would allow queries like “all statements citing journal articles by physicists at Oxford University in the 1970s”.
  6. Wikidata may be used for other scientometrics analyses than counting, e.g, generation of coauthor graphs and cocitation graphs giving context to an author or paper. The bubble chart above shows statistics for journals of papers in OpenfMRI generated with the standard Wikidata Query Service bubble chart visualization tool.
  7. Wikidata could be used for citations in Wikipedia. This may very well be problematic, as a large Wikipedia article could have hundreds of references and each reference needs to be fetched from Wikidata generating lots of traffic. I tried a single citation on the “OpenfMRI” article (it has later been changed). Some form of inclusion of Wikidata identifier in Wikipedia references could further Wikipedia bibliometrics, e.g., determine the most cited author across all Wikipedias.

Neuroinformatics coauthor network – so far

Posted on

neuroinformatics coauthor network 2016-06-28

Screenshot of neuroinformatics coauthor network – so far. Only the big cluster is shown. Network with Jonas Kress default setup querying WDQS.

Page rank of scientific papers with citation in Wikidata – so far

Posted on Updated on

A citation property has just be created a few hours ago, – and as of writing still not been deleted. It means we can describe citation network, e.g., among scientific papers.

So far we have added a few citations, – mostly from papers about Zika. And now we can plot the citation network or compute the network measures such as page rank.

Below is a Python program using everything with Sparql, Pandas and NetworkX:

statement = """
select ?source ?sourceLabel ?target ?targetLabel where {
  ?source wdt:P2860 ?target .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .

service = sparql.Service('')
response = service.query(statement)
df = DataFrame(response.fetchall(),

df.sourceLabel = df.sourceLabel.astype(unicode)
df.targetLabel = df.targetLabel.astype(unicode)

g = nx.DiGraph()
g.add_edges_from(((row.sourceLabel, row.targetLabel)
    for n, row in df.iterrows()))

pr = nx.pagerank(g)
sorted_pageranks = sorted((rank, title)
    for title, rank in pr.items())[::-1]

for rank, title in sorted_pageranks[:10]:
    print("{:.4} {}".format(rank, title[:40]))

The result:

0.02647 Genetic and serologic properties of Zika
0.02479 READemption-a tool for the computational
0.02479 Intrauterine West Nile virus: ocular and
0.02479 Internet encyclopaedias go head to head
0.02479 A juvenile early hominin skeleton from D
0.01798 Quantitative real-time PCR detection of 
0.01755 Zika virus. I. Isolations and serologica
0.01755 Genetic characterization of Zika virus s
0.0175 Potential sexual transmission of Zika vi
0.01745 Zika virus in Gabon (Central Africa)--20