science

The problem with Andreas Krause

Posted on Updated on

I first seem to have run into the name “Andreas Krause” in connection with NIPS 2017. Statistics with the Wikidata Query Service shows “Andreas Krause” to be one of the most prolific authors for that particular conference.

But who is “Andreas Krause”?

Google Scholar lists five “Andreas Krause”. An ETH Zürich machine learning researcher, a pharmacometrics researcher, a wood researcher, a economish networkish researcher  working from Bath and a Dresden-based battery/nano researcher. All the NIPS Krause works should likely be attributed to the machine learning researcher, and a read of the works reveal the affiliation to be to ETH Zürich.

An ORCID search reveals six “Andrea Krause”. Three of the Andreas Krause have no or almost no further information about them beyond the name and the ORCID identifier.

There is a Immanuel Krankenhaus Berlin rheumatologist which does not seem to be in Google Scholar.

There may even be more than these six “Andreas Krause”. For instance, the article Emotional Exhaustion and Job Satisfaction in Airport Security Officers – Work–Family Conflict as Mediator in the Job Demands–Resources Model has affiliation with “School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland, Olten, Switzerland”, thus topic and affiliation do not quite fit in with any of the previously mentioned “Andreas Krause”.

One interesting ambiguity is for Classification of rheumatoid joint inflammation based on laser imaging – which obviously is a rheumatology work but also has some machine learning aspects. There is computer scientist/machine learner Volker Tresp as co-author and the work is published in an IEEE venue. There is no affiliation on the “Andreas Krause” on the paper. It is likely the work of the rheumatologist, but you could also guess on the machine learner.

Yet another ambiguity is Biomarker-guided clinical development of the first-in-class anti-inflammatory FPR2/ALX agonist ACT-389949. The topic somewhat overlap between the pharmacokinetics and the domain of the Berlin researcher. The affiliation is to “Clinical Pharmacology, Actelion”, but interestingly, Google Scholar does not associate this paper with the pharmacokinetics researcher.

In conclusion, author disambiguation may be very difficult.

Scholia will can show the six Andreas Krause. But I am not sure that helps us very much.

Advertisements

Can you scrape Google Scholar?

Posted on

With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=5585223 will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g.,  https://www.wikidata.org/wiki/Q41620192#P2860. CrossRef may be another resource.

Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.

The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.

The Google robots.txt limits automated access with the following relevant lines:

Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?*cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholar_share

“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.

So if there is some information you can get from Google Scholar is it worth it?

The Scholia code now adds a googlescholar.py module with some preliminary Google Scholar processing attempts. There is command-line based scraping of a researcher profile. For instance,

python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ

It ain’t not working too well. As far as I can determine you need to page with JavaScript to get more than the initial 20 results (it would be interesting to examine the Publish or Perish software to see how a larger set of results is obtained). Not all bibliographic metadata is available for each item on the Google Scholar page – as far as I see: No DOI. No PubMed identifier. The author list may be abbreviated with an ellipsis (‘…’). Matching of the Google Scholar item with data already present in Wikidata seems not that straightforward.

It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.

Danish words for snow

Posted on Updated on

According to Laura Martin, Franz Boas may have been the first to point to the relative richness of Eskimo words for snow: “Eskimo Words for Snow”: A Case Study in the Genesis and Decay of an Anthropological Example. American Anthropologist, 88(2):418. Boas listed aput, qana, piqsirpoz, and qimuqsuq. English may have snow, hail, sleet, ice, icicle, slush, and snowflake as listed on the English Wikipedia on Eskimo words for snow. There seems to be more than that, e.g., firn. Danish is not (as) polysynthetic as Eskimo, but it has lots of compounds, which make it possible to create a good number of words for snow. Most of these words derive from sne and is.

Update 2017-09-13: Added skosse.

Word Translation Explanation
bræ large mass of ice
bundis ice at the bottom of the ocean/sea
drivis ice floating on the water, either “havis” og “søis”
firn firn snow older than a year
fastis
flodis ice from a river
fnug snowflake
frostsne snow below freezing, as oppose to tøsne
fygesne drifting snow
gletsjer/gletscher glacier
gletscheris ice in/from a glacier
grå is first stage of “ungis”, according to DMI
gråhvid is second state of “ungis”, according to DMI
hagl hail precipitate with small pellets of ice
haglkorn hailstone small pellet of ice
iglo/igloo  iglo
havis sea ice ice in the ocean/sea
indlandsis Indlandsisen is the big “iskappe” in Greenland
is ice frozen water that is (usually) transparent
isbarriere the edge of an “isshelf”, according to DMI
isbjerg
isblok block of ice
isfjeld
isflade sheet of ice
isflage floe
isfront the edge og a “isshelf”
isfod ice frozen to the coast or (second meaning) the ice below the water
iskalot ice-covered area near the poles
iskant the edge of a floe
iskappe ice cap very large connected mass of snow, e.g., the one in Greenland
isklump
isknold
iskorn see also “kornsne”
iskrystal
iskugle
isbræ large mass of ice, the same as “bræ”
islag layer of ice, not the same as “isslag”
isnåle
ispartikel
isrand the edge og a floe
isshelf floating gletcher
isskorpe layer of ice on top of water or snow
isskosse
isskruning ice pack
isslag glaze, black ice, freezing rain raindrops below freezing that becomes ice when hitting the ground or structure
isstykke a piece of ice
istap  icicle
isterning ice cube
isvand ice water water with ice in it, usually for drinking
isvæg
isø
julesne  Christmas snow snow falling or lying during Christmas
kalvis
kornsne
kunstsne  artificial snow snow artificially made
lavine avalanche
nysne snow recently falling, as opposed to firn
pakis “drivis” with a high concentration, according to DMI
polaris sea ice that have survived at least one summer meting
puddersne powder snow light snow
rim hard rime “white ice that forms when the water droplets in fog freeze to the outer surfaces of objects.” according to English Wikipedia
sjap slush
sjapis slush ice
skosse
slud sleet a mixture of rain and falling snow
sne snow used about falling snow and snow on the ground
snebold snowball snow formed as a ball, of used to through in a snowball fight
snebunke  pile of snow
snebyge snow shower
snedrive snowdrift
snedrys small amount of precipitation of snow
snedække layer/cover of snow
snefald
snefnug snowflake
snefog snowdrift
snefygning snow in strong wind
snehule snow formed as a cave for fun or survival, see also “igloo”
snehytte more or less the same as an “iglo”
snekorn snow grain
snekrystal
snelag layer of snow
snemand snowman snow formed as a sculpture of a human
snemark field of snow
snemasse mass of snow
snesjap slush
sneskred avalanche snow falling down a slope
snestjerne
snestorm snowstorm
snevejr snow weather with falling snow
tyndis thine ice
sjap sleet
søis “lake ice”
tøris (“tøris” is usually “dry ice”)
tøsne melting snow snow that is melting
ungis Sea ice between “tyndis” and “vinteris”, according to DMI
vinteris

Some information about Scholia

Posted on

Nielsen2017Scholia_coauthornormalizedcitations

Scholia is mostly a web service developed from GitHub at https://github.com/fnielsen/scholia in an open source fashion. It was inspired by discussions at the WikiCite 2016 meeting in Berlin. Anyone can contribute as long as their contribution is under GPL.

I started to write the Scholia code back in October 2016 according to the initial commit at https://github.com/fnielsen/scholia/commit/484104fdf60e4d8384b9816500f2826dbfe064ce Since then particularly Daniel Mietchen and Egon Willighagen have joined in and Egon has lately be quite active.

Users can download the code and run the web service from their own computer if they have a Python Flask development environment. Otherwise the canonical web site for Scholia is https://tools.wmflabs.org/scholia/ which anyone with an Internet connection should be able to view.

So what does Scholia do? The initial “application” was a “static” web page with a researcher profile/CV of myself based on data extracted from Wikidata. It is still available from: http://people.compute.dtu.dk/faan/fnielsenwikidata.html. I added a static page for my research section, DTU Cognitive Systems, showing scientific page production and a coauthor graph. This is available here: http://people.compute.dtu.dk/faan/cognitivesystemswikidata.html.

The Scholia web application was an extension of these initial static pages so a profile page for any researcher or any organization could be made on the fly. And now it is no longer just authors and organizations where there is a profile page, but also works, venues (journals or proceedings), series, publishers, sponsors (funders) and awards. We have also “topics” and individual pages showing specialized information about chemicals, proteins, diseases and biological pathways. A rudimentary search interface is implemented.

The content of the web pages of Scholia with plots and tables are made from queries to the Wikidata Query Service, – the extended SPARQL endpoint provided by the Wikimedia Foundation. We also pull in text from the introduction of the articles in the English Wikipedia. We modify the table output of the Wikidata Query Service so individual items displayed in table cells link back to other items in Scholia.

Egon Willighagen, Daniel Mietchen and I have described Scholia and Wikidata for scientometrics in the 16-pages workshop paper “Scholia and scientometrics with Wikidata” https://arxiv.org/pdf/1703.04222.pdf The screenshots shown in the paper has been uploaded to Wikimedia Commons. These and other Scholia media files are available in category page https://commons.wikimedia.org/wiki/Category:Scholia

Working with Scholia has been a great way to explore what is possible with SPARQL and Wikidata. One plot that I like is the “Co-author-normalized citations per year” plot on the organization pages. There is an example on this page: https://tools.wmflabs.org/scholia/organization/Q24283660. Here the citations to works authored by authors affiliated with the organization in question are counted and organized in a colored bar chart with respect to year of publication, – and normalized for the number of coauthors. The colored bar charts have been inspired by the “LEGOLAS” plots of Shubhanshu Mishra and Vetle Torvik.

Part of the Python Scholia code will also work as a command-line script for reference management in the LaTeX/BIBTeX environment using Wikidata as the backend. I have used this Scholia scheme for a couple of scientific papers I have written in 2017. The particular script is currently not well developed, so users would need to be indulgent.

Scholia relies on users adding bibliographic data to Wikidata. Tools from Magnus Manske are a great help as are Fatameh of “T Arrow” and “Tobias1984” and the WikidataIntegrator of the GeneWiki people. Daniel Mietchen, James Hare and a user called “GZWDer” have been very active adding much of the science bibligraphic information and we are now past 2.3 million scientific articles on Wikidata. You can count them with this link: https://tinyurl.com/yaux3uac

My h-index as of June 2017: Coverage of researcher profile sites

Posted on Updated on

The coverage of different researcher profile sites and their citation statistics varies. Google Scholar seems to be the site with the largest coverage, – it even crawls and indexes my slides. The open Wikidata is far from there, but may be the only one with machine-readable free access and advanced search.

Below is the citation statistics in the form of the h-index from five different services.

h Service
28 Google Scholar
27 ResearchGate
22 Scopus
22(?) Semantic Scholar
18 Web of Science
8 Wikidata

Semantic Scholar does not give an overview of the citation statistics, and the count is somewhat hidden on the individual article pages. I attempted as best as I could to determine the value, but it might be incorrect.

I made a similar statistics on 8 May 2017 and reported it on the slides Wikicite (page 42). During the one and a half month since that count, the statistics for Scopus has change from 20 to 22.

Semantic Scholar is run by the Allen Institute for Artificial Intelligence, a non-profit research institute, so they may be interested in opening up their data for search. An API does, to my knowledge, not (yet?) exist, but they have a gentle robots.txt. It is also possible to download the full Semantic Scholar corpus from http://labs.semanticscholar.org/corpus/. (Thanks to Vladimir Alexiev for bringing my attention to this corpus).

When does an article cite you?

Posted on Updated on

Google Scholar alerted me to a recent citation to my work from Teacher-Student Relationships, Satisfaction, and Achievement among Art and Design College Students in Macau, a paper published in Journal of Education and Practice of to me unknown repute.

In the references, I see a listing of Persistence of Web References in Scientific Research where I was among the coauthors. So in which context is this paper cited? I seems strange that an article about link rot is cited by an article about teacher-student relationships… Indeed I cannot find the reference in body text when I search on the first author’s last name (“lawrence”).

Indeed several other items in references listing I cannot find: Joe Smith’s “One of Volvo’s core values”, Strunk et al.’s “The element of style” and Van der Geer’s “The art of writing a scientific article”. Notable is it that the first four references is out of order in the otherwise alphabetic sorted list of references, so there must be an error. Perhaps it is an error arising from a copy-and-paste typo?

In this case, I would say, that even though being listed, I am not actually cited by the article. The “fact” of whether it is a citation or not is important to discuss if we want to record the citation in Wikidata, where “Persistence of Web References in Scientific Research” is recorded with the item Q21012586, see also the Scholia entry. Possible we could record the erroneous citation and the use the Wikidata deprecated rank facility: “Value is known to be wrong but (used to be) commonly believed”.

Some statistics on scholarly data in Wikidata

Posted on Updated on

The Wikicite initiative have spawned a lot of work on bibliographic/source information in Wikidata. Particularly scholarly bibliographic information has been added to Wikidata. Recently James Hare announced that we have over 3 million citations recorded in Wikidata, – mostly due to automated additions made by Hare himself.

With the tools of Magnus Manske and James Hare that are presently central to the growth of scholarly bibliographic data on Wikidata, we do not get a direct link to the authors items of Wikidata. Such information presently needs to be added manually or in a semi-automated fashion. Sponsor/funding information is neither added automatically, – except for a US organization where James Hare added this information.

So how much data do we have in Wikidata when we ask if the data is linked to other Wikidata items? Below are a few queries to the Wikidata Query Service that attempt to answer some aspects of this question.

Scientific articles

How many items do we have in Wikidata that describe a scientific article and that is linked to an author item?

SELECT (COUNT(DISTINCT ?work) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
}

The query returns 45’253.

How many scientific articles with one or more author items and no author name string (indicating that the author linking may be complete).

SELECT (COUNT(DISTINCT ?work) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  FILTER NOT EXISTS { ?work wdt:P2093 ?authorname }
}

This query gives 3’567.

How many items do we have in Wikidata that is claimed to be a scientific article?

SELECT (COUNT(DISTINCT ?work) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
}

This query gives 677’630.

Scientific authors

How many authors are in Wikidata that have written a scientific article?

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
}

The query returns 10’193.

How many authors are in Wikidata that have written a scientific article and where the gender is indicated?

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?author wdt:P21 ?gender .
}

This query gives 8’853.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
}

This query returns 6’586.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations and the cited work is recorded with one or more author items.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
  ?cited_work wdt:P50 ?cited_author .
}

This query returns 5’614.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations and the cited work is recorded with one or more author items and where the genders of both the citing and the cited author are known.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
  ?cited_work wdt:P50 ?cited_author .
  ?author wdt:P21 ?gender .
  ?cited_author wdt:P21 ?cited_gender .
}

This query gives 4,730.

How many authors are there in Wikidata that have written a scientific article and where the scientific article is recorded having made one or more citations and the cited work is recorded with one or more author items and where the genders of both the citing and the cited author are known and where there is no author name string in neither the work nor the cited work (indicating that the work and the cited work may be completely linked with respect to author name.

SELECT (COUNT(DISTINCT ?author) AS ?count)
WHERE {
  ?work wdt:P31 wd:Q13442814 .
  ?work wdt:P50 ?author .
  ?work wdt:P2860 ?cited_work .
  ?cited_work wdt:P50 ?cited_author .
  ?author wdt:P21 ?gender .
  ?cited_author wdt:P21 ?cited_gender .
  FILTER NOT EXISTS { ?work wdt:P2093 ?authorname }
  FILTER NOT EXISTS { ?cited_work wdt:P2093 ?cited_authorname }
}

This query gives only 551.

Sponsor/funders

Sponsors of scientific articles ordered by number of citations.

SELECT ?number_of_citations ?sponsorLabel
WITH {
  SELECT (COUNT(?citing_work) AS ?number_of_citations) ?sponsor
  WHERE {
    ?work wdt:P859 ?sponsor .
    ?work wdt:P31 wd:Q13442814 .
    ?citing_work wdt:P2860 ?work .
  }
  GROUP BY ?sponsor
} AS %result
WHERE {
  INCLUDE %result
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?number_of_citations)
LIMIT 5

This query gives National Institute for Occupational Safety and Health, Lundbeck Foundation, The Danish Council for Strategic Research, National Institute of Allergy and Infectious Diseases, University of Wisconsin–Madison.