Latest Event Updates

My h-index as of June 2017: Coverage of researcher profile sites

Posted on

The coverage of different researcher profile sites and their citation statistics varies. Google Scholar seems to be the site with the largest coverage, – it even crawls and indexes my slides. The open Wikidata is far from there, but may be the only one with machine-readable free access and advanced search.

Below is the citation statistics in the form of the h-index from five different services.

h Service
28 Google Scholar
27 ResearchGate
22 Scopus
22(?) Semantic Scholar
18 Web of Science
8 Wikidata

Semantic Scholar does not give an overview of the citation statistics, and the count is somewhat hidden on the individual article pages. I attempted as best as I could to determine the value, but it might be incorrect.

I made a similar statistics on 8 May 2017 and reported it on the slides Wikicite (page 42). During the one and a half month since that count, the statistics for Scopus has change from 20 to 22.

Semantic Scholar is run by the Allen Institute for Artificial Intelligence, a non-profit research institute, so they made be interested in opening up their data for search. An API does, to my knowledge, not (yet?) exist, but they have a gentle robots.txt. It is also possible to download the full Semantic Scholar corpus from http://labs.semanticscholar.org/corpus/. (Thanks to Vladimir Alexiev for bringing my attention to this corpus).

When does an article cite you?

Posted on Updated on

Google Scholar alerted me to a recent citation to my work from Teacher-Student Relationships, Satisfaction, and Achievement among Art and Design College Students in Macau, a paper published in Journal of Education and Practice of to me unknown repute.

In the references, I see a listing of Persistence of Web References in Scientific Research where I was among the coauthors. So in which context is this paper cited? I seems strange that an article about link rot is cited by an article about teacher-student relationships… Indeed I cannot find the reference in body text when I search on the first author’s last name (“lawrence”).

Indeed several other items in references listing I cannot find: Joe Smith’s “One of Volvo’s core values”, Strunk et al.’s “The element of style” and Van der Geer’s “The art of writing a scientific article”. Notable is it that the first four references is out of order in the otherwise alphabetic sorted list of references, so there must be an error. Perhaps it is an error arising from a copy-and-paste typo?

In this case, I would say, that even though being listed, I am not actually cited by the article. The “fact” of whether it is a citation or not is important to discuss if we want to record the citation in Wikidata, where “Persistence of Web References in Scientific Research” is recorded with the item Q21012586, see also the Scholia entry. Possible we could record the erroneous citation and the use the Wikidata deprecated rank facility: “Value is known to be wrong but (used to be) commonly believed”.

Some statistics on scholarly data in Wikidata

Posted on Updated on

The Wikicite initiative have spawned a lot of work on bibliographic/source information in Wikidata. Particularly scholarly bibliographic information has been added to Wikidata. Recently James Hare announced that we have over 3 million citations recorded in Wikidata, – mostly due to automated additions made by Hare himself.

With the tools of Magnus Manske and James Hare that are presently central to the growth of scholarly bibliographic data on Wikidata, we do not get a direct link to the authors items of Wikidata. Such information presently needs to be added manually or in a semi-automated fashion. Sponsor/funding information is neither added automatically, – except for a US organization where James Hare added this information.

So how much data do we have in Wikidata when we ask if the data is linked to other Wikidata items? Below are a few queries to the Wikidata Query Service that attempt to answer some aspects of this question.

Scientific articles

How many items do we have in Wikidata that describe a scientific article and that is linked to an author item?

SELECT (COUNT(DISTINCT ?work) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
}

The query returns 45’253.

How many scientific articles with one or more author items and no author name string (indicating that the author linking may be complete).

SELECT (COUNT(DISTINCT ?work) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  FILTER NOT EXISTS { ?work wdt:P2093 ?authorname }
}

This query gives 3’567.

How many items do we have in Wikidata that is claimed to be a scientific article?

SELECT (COUNT(DISTINCT ?work) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
}

This query gives 677’630.

Scientific authors

How many authors are in Wikidata that have written a scientific article?

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
}

The query returns 10’193.

How many authors are in Wikidata that have written a scientific article and where the gender is indicated?

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?author wdt:P21 ?gender .
}

This query gives 8’853.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
}

This query returns 6’586.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations and the cited work is recorded with one or more author items.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
  ?cited_work wdt:P50 ?cited_author .
}

This query returns 5’614.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations and the cited work is recorded with one or more author items and where the genders of both the citing and the cited author are known.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
  ?cited_work wdt:P50 ?cited_author .
  ?author wdt:P21 ?gender .
  ?cited_author wdt:P21 ?cited_gender .
}

This query gives 4,730.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations and the cited work is recorded with one or more author items and where the genders of both the citing and the cited author are known and where there is no author name string in neither the work nor the cited work (indicating that the work and the cited work may be completely linked with respect to author name.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
  ?cited_work wdt:P50 ?cited_author .
  ?author wdt:P21 ?gender .
  ?cited_author wdt:P21 ?cited_gender .
  FILTER NOT EXISTS { ?work wdt:P2093 ?authorname }
  FILTER NOT EXISTS { ?cited_work wdt:P2093 ?cited_authorname }
}

This query gives only 551.

Sponsor/funders

Sponsors of scientific articles ordered by number of citations.

SELECT ?number_of_citations ?sponsorLabel
WITH {
  SELECT (COUNT(?citing_work) AS ?number_of_citations) ?sponsor
  WHERE {
    ?work wdt:P859 ?sponsor .
    ?work wdt:P31 wd:Q13442814 .
    ?citing_work wdt:P2860 ?work .
  }
  GROUP BY ?sponsor
} AS %result
WHERE {
  INCLUDE %result
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?number_of_citations)
LIMIT 5

This query gives National Institute for Occupational Safety and Health, Lundbeck Foundation, The Danish Council for Strategic Research, National Institute of Allergy and Infectious Diseases, University of Wisconsin–Madison.

How to quickly generate word analogy datasets with Wikidata

Posted on Updated on

One popular task in computational linguistics/natural language processing is the word analogy task: Copenhagen is to Denmark as Berlin is to …?

With queries to Wikidata Query Service (WDQS) it is reasonably easy to generate word analogy datasets in whatever (Wikidata-supported) language you like. For instance, for capitals and countries, a WDQS SPARQL query that returns results in Danish could go like this:

select
  ?country1Label ?capital1Label
  ?country2Label ?capital2Label
where { 
  ?country1 wdt:P36 ?capital1 .
  ?country1 wdt:P463 wd:Q1065 .
  ?country1 wdt:P1082 ?population1 .
  filter (?population1 > 5000000)
  ?country2 wdt:P36 ?capital2 .
  ?country2 wdt:P463 wd:Q1065 .
  ?country2 wdt:P1082 ?population2 .
  filter (?population2 > 5000000)
  filter (?country1 != ?country2)
  service wikibase:label
    { bd:serviceParam wikibase:language "da". }  
} 
limit 1000

Follow this link to get to the query and press “Run” to get the results. It is possible to download the table as CSV-formatted (see under “Download”). One issue to note that you have multiple entries for countries with multiple capital cities, e.g., Sydafrika (South Africa) is listed with Pretoria, Kapstaden (Cape Town) and Bloemfontein.

How much does it cost to buy all my scientific articles?

Posted on Updated on

How much does it cost to buy all my scientific articles?

Disregarding the slight difference in exchange rate between the current Euro and USD the answer is around 1’200 USD/Euros. That is the amount of money I would have to pay to download all the scientific articles I have been involved in, – if I did not have access to a university library with subscription. I have signed off the copyright to many articles to a long string of publishers, Elsevier, Wiley, IEEE, Springer, etc., and I no longer control the publication.

I have added a good number of my articles to Wikidata including the price for each article. The SPARQL-based Wikidata Query Service is able to generate a table with the price information, see here. The total sum is also available after a slight modication of the SPARQL query.

The Wikidata Query Service can also generate plots, for instance, of the price per page as a function of the publication date (choose “Graph builder” under “Display”). In the plot below the unit (currency) is mixed USD and Euro. (there seem to be  an issue with the shapes in the legend)

article-prices

Something like 3 to 4 USD/Euros per page seems to what an eyesight averaging comes to.

Among the most expensive articles are the ones from the journal Neuroinformatics published by Springer: 43.69 Euros for each article. Wiley articles cost 38 USD and the Elsevier articles around 36 USD. The Association for Computing Machinery sells their articles for only 15 USD. A bargain.

It may be difficult to find the price of the articles. Science claims that “Science research is available free with registration one year after initial publication.” However, I was not able to get to the full text for The Real Power of Artificial Markets on the Science website. On one page you can stubble onto this: “Purchase Access to this Article for 1 day for US$30.00” and that is what I put into Wikidata. The article is fairly short so this price makes it the priciest article per page.

science-price

I ought to write something discerning about the state of scientific publishing. However, I will instead redirect you to a recent blog post by Tal Yarkoni.

“Overzealous business types”?

Posted on Updated on

The University of Copenhagen and its problematic dismissal of notable scientist Hans Thybo have now landed in an editorial of Nature: “Corporate culture spreads to Scandinavia“. Their concluding claim is that “the threat is the colonization of universities by overzealous business types” (against academic freedom).

Interestingly, though the majority of the university board members is required by law to be from outside the university (not necessarily business), the university management has usually an academic background. And this is also the case for the management around Hans Thybo:

  1. The head of department for Hans Thybo is Claus Beier, see “Hans Thybos institutleder om fyringssagen“. Beier is a PhD and a professor with a long series of publications in climate change as can be studied on Google Scholar.
  2. Dean is John Renner Hansen, see “KU spildte ½ million på konsulentundersøgelse af Thybo for misbrug af forskningsmidler“. He is also researcher and claims to have “Approximately 600 publications in international refereed journals”
  3. Head of the university is Ralf Hemmingsen that I know as a notable researcher in psychiatry.

I am not convinced by the arguments in the Nature editorial which sets up “business types” against academics. I think that the case should rather be seen against the background of the case with Milena Penkowa and another story around the possible abuse of research funds on the Copenhagen University Hospital, see “Ny sag om fusk med penge til forskning“.

Guess which occupation is NOT the most frequent among persons from the Panama Papers

Posted on Updated on

POLITICIAN! Occupation as politician is not very frequent among people in the Panama Papers. This may come as a surprise to those who had studied a bubble chart put in a post on my blog. A sizeable portion of blog readers, tweeters and probably also Facebook users seem to have seriously misunderstood it. The crucial problem with the chart is that it is made from data in Wikidata, which only contains a very limited selection of persons from the Panama Papers. Let me tell you some background and detail the problem:

  1. Open Knowledge Foundation Danmark hosted a 2-hours meetup in Cafe Nutid organized by Niels Erik Kaaber Rasmussen the day after the release of the Panama Papers. We were around 10 data nerds sitting with our laptops and with the provided links most if not all started downloading the Panama Papers data files with the names and company information. Some tried installing the Neo4J database which may help querying the data.
  2. I originally spend most of my time at the cafe looking through the data by simple means. I used something like “egrep -i denmark’ on the officers.csv file. This quick command will likely pull out most of the Danish people in the release Panama Papers. The result of the command is a small manageable list of not more than 70 listings. Among the names I recognized NO politician, neither Danish nor international.
  3. The Danish broadcasting company DR has had a priority access to the data. It is likely they have examined the more complete data in detail. It is also likely that if there had been a Danish politician in the Panama Papers DR would have focused on that, breaking the story. NO such story came.. Thus I think that it is unlikely that there is any Danish politicians in the more complete Panama Papers dataset.
  4. Among the Danish listings in the officers.csv file from the released Panama Papers we found a couple of recognizable names. Among them was the name Knud Foldschack. Already Monday, the day of the release, a Danish newssite had run a media story about that name. One Knud Foldschack is a lawyer who has involved himself in cases for leftwing causes. Having such a lawyer mentioned in the Panama Papers was a too-good-to-be-true media story, – and it was. It turned out that Knud Foldschack had no less than both a father and a brother with the same name, and the newssite now may look forward to meet one of the Foldschacks in court as he wants compensation for being wrongly smeared. His brother seems to be some sort of business man. René Bruun Lauritsen is another name within the Danish part of the Panama Papers. A person bearing that name has had unfavourable mentioning in Danish media. One of the stories was his scheme of selling semen to women in need of a pregnancy. His unauthorized handling of semen with hand delivery got him a bit of a sentence. Another scheme involved outrageous stock trading. Whether Panama-Lauritsen is the same as Semen-Lauritsen I do not know, but one would be disappointed if such an unethical businessman was not in the Panama Papers. A third name shares a fairly unique name with a Danish artist. To my knowledge Danish media had not run any story on that name. But the overall conclusion of the small sample investigated, is that politicians are not present, but names may be related to business persons and possibly an artist.
  5. Wikidata is a site in the Wikipedia family of sites. Though not well-known, the Wikidata site is one of the most interesting projects related to Wikipedia and in terms of main namespace pages far larger than the English Wikipedia. Wikidata may be characterized as the structured cousin of WIkipedia. Rather than edit in free-form natural language as you do in Wikipedia, in Wikidata you only edit in predefined fields. Several thousand types of fields exist. To describe a person you may use fields such as date of birth, occupation, authority identifiers, such as VIAF, homepage and sex/gender.
  6. So what is in Wikidata? Items corresponding to almost all Wikipedia articles appear in Wikidata – not just the articles in the English Wikipedia, but also for every language version of Wikipedia. Apart from these items which can be linked to WIkipedia articles, Wikidata also has a considerable number of other items. For instance, one Dutch user has created items for a great number of paintings for the National Gallery of Denmark, – painting which for the most part have no Wikipedia article in any language. Although Wikidata records an impressive number of items, it does not record everything. The number of persons in Wikidata is only 3276363 at the time of writing and rarely includes persons that hasn’t made his/her mark in media. The typical listing in the Panama Papers is a relative unknown man. He will unlikely appear in Wikidata. And no one adds such a person just because s/he is listed in the Panama Papers. Obviously Wikidata has an extraordinary bias against famous persons: politicians, nobility, sports people, artists, performers of any kind, etc.
  7. Items for persons in Wikidata who also appear in the Panama Papers can indicate a link to the Panama Papers. There is no dedicated way to do this but the  ‘key event’ property has been used for that. It is apparently noted Wikimedian Gerard Meijssen who has made most of these edits. How complete it is with respect to persons in Wikidata I do not know, but Meijssen also added two Danish football players who I believe where only mentioned in Danish media. He could have relied on the English Wikipedia which had a overview of Panama Paper-listed people.
  8. When we have data in Wikidata, there are various ways to query the data and present them. One way use wiki whizkid Magnus Manske’s Listeria service with a query on any Wikipedia. Manske’s tool automagically builds a table with information. Wikimedia Danmark chairman Ole Palnatoke Andersen apparently had discovered Meijssen’s work on Wikidata, and Palnatoke used Manske’s tool to make a table with all people in Wikidata marked with the ‘key event’ “Panama Papers”. It only generates a fairly small list as not that many people in Wikidata are actually linked to the Panama Papers. Palnatoke also let Manske’s tool show the occupation for each person.
  9. Back to the Open Knowledge Foundation meeting in Copenhagen Tuesday evening: I was a bit disappointed not being able to data mine any useful information from the Panama Papers dataset. So after becoming aware of Palnatoke’s table I grabbed (stole) his query statement and modified to count the number of occupations. Wikimedia Foundation – the organization that hosts Wikipedia and Wikidata – has setup a so-called SPARQL endpoint and associated graphical interface. It allows any Web user to make powerful queries across all of Wikidata’s many millions of statements, including the limited number of statements about Panama Papers. The service is under continuous development and has in the past been somewhat unstable, but nevertheless is a very interesting service. Frontend developer Jonas Kress has in 2016 implemented several ways to display the query result. Initially it was just a plain table view, but now features results on a map – if any geocoordinates are along in the query result – and a bubble chart if there is any numerical data in the query result. Other later implemented forms of output results are timelines, multiview and networks. Making a bubble chart with counts of occupations with the SPARQL service is nothing more than a couple of lines of commands in the SPARQL language, and a push on the “Run” button. So the Panama Papers occupation bubble chart should rather be seen as a demonstration of capabilities of Wikidata and its associated services for quick queries and visualizations rather than a faithful representation of occupation of people mentioned in the released Panama Papers.
  10. A sizeable portion of people misunderstood the plot and regarded it as evidence of the dark deeds of politicians. Rather than a good understanding of the technical details of Wikidata, people used their preconceived opinions about politicians to interpret the bubble chart. They were helped along the way by, in my opinion, misleading title (“Panama Papers bubble chart shows politicians are most mentioned in document leak database”) and incomplete explanation in an article of The Independent. On the other hand, Le Monde had a good critical article.
  11. I believe my own blog where I published the plot was not to blame. It does include a SPARQL command so any knowledgeable person can see and modify the results himself/herself. Perhaps some people were confused of my blog describing me as a researcher, – and thought that this was a research result on the Panama Papers.
  12. My blog has in its several years of existence had 20,000 views. The single post with the Panama Papers bubble chart yielded a 10 fold increase in the number of views over the course of a few days, – my first experience with a viral post. Most referrals were from Facebook. The referral does not indicate which page on Facebook it comes from, so it is impossible to join the discussion and clarify any misunderstanding. A portion of referrals also came from Twitter and Reddit where I joined the discussion. Also social media users using the WordPress comment feature on my blog I tried to engage. On Reddit I felt a good response while for Facebook I felt it was irresponsible. Facebook boosts misconceptions and does not let me join the discussion and engage to correct any misconceptions.

    panamabubble
    The plot of a viral post: Views on my blog around the time with the Panama Papers bubble chart publication.
  13. Is there anything I could have done? I could have erased my two tweets and modified my blog post introducing a warning with a stronger explanation.

Summing up my experience with the release of the Panama Papers and the subsequent viral post, I find that our politicians show not to be corrupt and do not deal with shady companies – except for a few cases. Rather it seems that loads of people had preconceived opinions about their politicians and they are willing to spread their ill-founded beliefs to the rest of the world. They have little technical understand and does not question data provenance. The problems may be augmented by Facebook.

And here is the now infamous plot:

PanamaPapersOccupations