With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=5585223 will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g., https://www.wikidata.org/wiki/Q41620192#P2860. CrossRef may be another resource.
Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.
The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.
The Google robots.txt limits automated access with the following relevant lines:
Disallow: /scholar Disallow: /citations? Allow: /citations?user= Disallow: /citations?*cstart= Allow: /citations?view_op=new_profile Allow: /citations?view_op=top_venues Allow: /scholar_share
“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.
So if there is some information you can get from Google Scholar is it worth it?
python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ
It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.
The coverage of different researcher profile sites and their citation statistics varies. Google Scholar seems to be the site with the largest coverage, – it even crawls and indexes my slides. The open Wikidata is far from there, but may be the only one with machine-readable free access and advanced search.
Below is the citation statistics in the form of the h-index from five different services.
|18||Web of Science|
Semantic Scholar does not give an overview of the citation statistics, and the count is somewhat hidden on the individual article pages. I attempted as best as I could to determine the value, but it might be incorrect.
I made a similar statistics on 8 May 2017 and reported it on the slides Wikicite (page 42). During the one and a half month since that count, the statistics for Scopus has change from 20 to 22.
Semantic Scholar is run by the Allen Institute for Artificial Intelligence, a non-profit research institute, so they may be interested in opening up their data for search. An API does, to my knowledge, not (yet?) exist, but they have a gentle robots.txt. It is also possible to download the full Semantic Scholar corpus from http://labs.semanticscholar.org/corpus/. (Thanks to Vladimir Alexiev for bringing my attention to this corpus).
Recently Lambert Heller wrote an overview piece on websites for scholarly profile pages: “What will the scholarly profile page of the future look like? Provision of metadata is enabling experimentation“. There he tabularized the features of the various online sites having scholarly profile pages. These sites include (with links to my entries): ORCID, ResearchGate, Mendeley, Pure and VIVO (don’t know these two), Google Scholar and Impactstory. One site missing from the equation is Wikidata. It can produce scholarly profile pages too. The default Wikidata editing interface may not present the data in a nice way – Magnus Manske’s Reasonator – better, but very much of the functionality is there to make a scholarly profile page.
In terms of the features listed by Heller, I will here list the possible utilization of Wikidata:
- Portrait picture: The P18 property can record Wikimedia Commons image related to a researcher. For instance, you can see a nice photo of neuroimaging professor Russ Poldrack.
- Researchers alternative names: This is possible with the alias functionality in Wikidata. Poldrack is presently recorded with the canonical label “Russell A. Poldrack” and the alternative names “Russell A Poldrack”, “R. A. Poldrack”, “Russ Poldrack” and “R A Poldrack”. It is straightforward to add more variations
- IDs/profiles in other systems: There are absolutely loads of these links in Wikidata. To name a few deep linking posibilities: Twitter, Google Scholar, VIAF, ISNI, ORCID, ResearchGate, GitHub and Scopus. Wikidata is very strong in interlinking databases.
- Papers and similar: Papers are presented as items in Wikidata and these items can link to the author via P50. The reverse link is possible with a SPARQL query. Futhermore, on the researcher’s items it is possible to list main works with the appropriate property. Full texts can be linked with the P953 property. PDF of papers with an appropriate compatible license can be uploaded to Wikimedia Commons and/or included in Wikisource.
- Uncommon research product: I am not sure what this is, but the developer of software services is recorded in Wikidata. For instance, for the neuroinformatics database OpenfMRI it is specified that Poldrack is the creator. Backlinks are possible with SPARQL queries.
- Grants, third party funding. Well there is a sponsor property but how it should be utilized for researchers is not clear. With the property, you can specify that paper or research project were funded by an entity. For the paper The Center for Integrated Molecular Brain Imaging (Cimbi) database you can see that it is funded by the Lundbeck Foundation and Rigshospitalet.
- Current institution: Yes. Employer and affiliation property is there for you. You can see an example of an incomplete list of people affiliated with research sections at my department, DTU Compute, here, – automagically generated by the Magnus Manske’s Listeria tool.
- Former employers, education etc.: Yes. There is a property for employer and for affiliation and for education. With qualifiers you can specify the dates of employment.
- Self assigned keywords: Well, as a Wikidata contributor you can create new items and you can use these items for specifying field of work of to label you paper with main theme.
- Concept from controlled vocabulary: Whether Wikidata is a controlled vocabulary is up for discussion. Wikidata items can be linked to controlled vocabularies, e.g., Dewey’s, so there you can get some controlness. For instance, the concept “engineer” in Wikidata is linked the BNCF, NDL, GND, ROME, LCNAF, BNF and FAST.
- Social graph of followers/friends: No, that is really not possible on Wikidata.
- Social graph of coauthors: Yes, that is possible. With Jonas Kress’ work on D3 enabling graph rendering you got on-the-fly graph rendering in the Wikidata Query Service. You can see my coauthor graph here (it is wobbly at the moment, there is some D3 parameter that need a tweak).
- Citation/attention metadata from platform itself: No, I don’t think so. You can get page view data from somewhere on the Wikimedia sites. You can also count the number of citations on-the-fly, – to an author, to a paper, etc.
- Citation/attention metadata from other sources: No, not really.
- Comprehensive search to match/include own papers: Well, perhaps not. Or perhaps. Magnus Manske’s sourcemd and quickstatement tools allow you to copy-paste a PMID or DOI in a form field press two buttons to grap bibliographic information from PubMed and a DOI source. One-click full paper upload is not well-supported, – to my knowledge. Perhaps Daniel Mietchen knows something about this.
- Forums, Q&A, etc.: Well, yes and no. You can use the discussion pages on Wikidata, but these pages are perhaps mostly for discussion of editing, rather than the content of the described item. Perhaps Wikiversity could be used.
- Deposit own papers: You can upload appropriately licensed papers to Wikimedia Commons or perhaps Wikisource. Then you can link them from Wikidata.
- Research administration tools: No.
- Reuse of data from outside the service: You better believe! Although Wikidata is there to be used, a mass download from the Wikidata Query Service can run into timeout problems. To navigate the structure of individual Wikidata item, you need programming skills, – at least for the moment. If you are really desperate you can download the Wikidata dump and Blazegraph and try to setup your own SPARQL server.
Suppose you want to measure the performance of individual researchers of a university department. Which variables can you get hold on and how relevant would they be to measure academic performance?
Here is my take on it:
- Google Scholar citations number. Google Scholar records total number of citations, h-index and i10-index as well as the numbers for a fixed period.
- Scopus citation numbers.
- Twitter. The number of tweets and the number of followers would be relevant.
One issue here is that the number of tweets may not be relevant to the academic performance and it is also susceptible to manipulation. Interestingly there has been a comparison between Twitter numbers and standard citation counts with a coefficient between the two numbers named the Kardashian index.
- Wikidata and Wikipedia presence. Whether Wikidata has a item of the researcher, the number of articles of the researchers, the number of bytes they span, the number of articles recorded in Wikidata. There is an API to get these numbers, and – interestingly – Wikidata can record a range of other identifiers for Google Scholar, Scopus, Twitter, etc. which would make it a convenient open database for keeping track of researcher identifiers across sites of scientometric relevance.
The number of citations in Wikipedia to the work of a researcher would be interesting to have, but is somewhat more difficult to automatically obtain.
The numbers of Wikipedia and Wikidata are a bit manipulable.
- Stackoverflow/Stackexchange points in relevant areas. The question/answering sites under the Stackexchange umbrella have a range of cites that are of academic interest. In my area, e.g., Stackoverflow and Cross Validated.
- GitHub repositories and stars.
- Publication download counts. For instance, my department has a repository with papers and the backend keeps track of statistics. The most downloaded papers tend to be introductory or material and overviews.
- ResearchGate numbers: Publications, reads, citations and impact points.
- ResearcherID (Thomson Reuters) numbers: total articles in publication list, articles with citation data, sum of the time cited, average citations per article, h-index.
- Microsoft Academic Search numbers.
- Count in the dblp computer science bibliography (the Trier database).
- Count of listings in ArXiv.
- Counts in Semantic Scholar.
- ACM digital library counts.
I have just received a citation alert from the Google Scholar system as I was cited in http://firstmonday.org/ojs/index.php/fm/article/view/3203/3019
Interestingly, the alert did not come from the First Monday journal directly but from a paper on firstmonday.insurancetribe.com (see the excerpt below). To me it seems that insurancetribe.com is abusing First Monday material on their site. Their URL redirects to homesecurityfix.com. This must be spam.
[HTML] Font Size Current Issue Atom logo
http : / / scholar.google.com/scholar_url?url=http://firstmonday.insurancetribe.com/ojs/index.php/fm/article/view/3203/3019>
D Geifman, DR Raban, R Sheizaf
Abstract Prediction Markets are a family of Internet–based social
which use market price to aggregate and reveal information and opinion
audiences. The considerable complexity of these markets inhibited the
full realization of *…*
When I last checked, Google Scholar redirected to the spam site. However, I cannot find the insurancetribe version among the indexed versions now :
I typed in many of the publications from our Responsible Business in the Blogosphere project into the Brede Wiki together with an identifier for Google Scholar for each publication. With a script (do not run code you obtained from a wiki!) I am able to collect the citation information from Google Scholar using the MediaWiki API, the categories and the templates. Here it is sorted according to number of citations.