scientometrics

My h-index as of June 2017: Coverage of researcher profile sites

Posted on Updated on

The coverage of different researcher profile sites and their citation statistics varies. Google Scholar seems to be the site with the largest coverage, – it even crawls and indexes my slides. The open Wikidata is far from there, but may be the only one with machine-readable free access and advanced search.

Below is the citation statistics in the form of the h-index from five different services.

h Service
28 Google Scholar
27 ResearchGate
22 Scopus
22(?) Semantic Scholar
18 Web of Science
8 Wikidata

Semantic Scholar does not give an overview of the citation statistics, and the count is somewhat hidden on the individual article pages. I attempted as best as I could to determine the value, but it might be incorrect.

I made a similar statistics on 8 May 2017 and reported it on the slides Wikicite (page 42). During the one and a half month since that count, the statistics for Scopus has change from 20 to 22.

Semantic Scholar is run by the Allen Institute for Artificial Intelligence, a non-profit research institute, so they may be interested in opening up their data for search. An API does, to my knowledge, not (yet?) exist, but they have a gentle robots.txt. It is also possible to download the full Semantic Scholar corpus from http://labs.semanticscholar.org/corpus/. (Thanks to Vladimir Alexiev for bringing my attention to this corpus).

When does an article cite you?

Posted on Updated on

Google Scholar alerted me to a recent citation to my work from Teacher-Student Relationships, Satisfaction, and Achievement among Art and Design College Students in Macau, a paper published in Journal of Education and Practice of to me unknown repute.

In the references, I see a listing of Persistence of Web References in Scientific Research where I was among the coauthors. So in which context is this paper cited? I seems strange that an article about link rot is cited by an article about teacher-student relationships… Indeed I cannot find the reference in body text when I search on the first author’s last name (“lawrence”).

Indeed several other items in references listing I cannot find: Joe Smith’s “One of Volvo’s core values”, Strunk et al.’s “The element of style” and Van der Geer’s “The art of writing a scientific article”. Notable is it that the first four references is out of order in the otherwise alphabetic sorted list of references, so there must be an error. Perhaps it is an error arising from a copy-and-paste typo?

In this case, I would say, that even though being listed, I am not actually cited by the article. The “fact” of whether it is a citation or not is important to discuss if we want to record the citation in Wikidata, where “Persistence of Web References in Scientific Research” is recorded with the item Q21012586, see also the Scholia entry. Possible we could record the erroneous citation and the use the Wikidata deprecated rank facility: “Value is known to be wrong but (used to be) commonly believed”.

Altmetrics for a department

Posted on

Suppose you want to measure the performance of individual researchers of a university department. Which variables can you get hold on and how relevant would they be to measure academic performance?

Here is my take on it:

  1. Google Scholar citations number. Google Scholar records total number of citations, h-index and i10-index as well as the numbers for a fixed period.
  2. Scopus citation numbers.
  3. Twitter. The number of tweets and the number of followers would be relevant.
    One issue here is that the number of tweets may not be relevant to the academic performance and it is also susceptible to manipulation. Interestingly there has been a comparison between Twitter numbers and standard citation counts with a coefficient between the two numbers named the Kardashian index.
  4. Wikidata and Wikipedia presence. Whether Wikidata has a item of the researcher, the number of articles of the researchers, the number of bytes they span, the number of articles recorded in Wikidata. There is an API to get these numbers, and – interestingly – Wikidata can record a range of other identifiers for Google Scholar, Scopus, Twitter, etc. which would make it a convenient open database for keeping track of researcher identifiers across sites of scientometric relevance.
    The number of citations in Wikipedia to the work of a researcher would be interesting to have, but is somewhat more difficult to automatically obtain.
    The numbers of Wikipedia and Wikidata are a bit manipulable.
  5. Stackoverflow/Stackexchange points in relevant areas. The question/answering sites under the Stackexchange umbrella have a range of cites that are of academic interest. In my area, e.g., Stackoverflow and Cross Validated.
  6. GitHub repositories and stars.
  7. Publication download counts. For instance, my department has a repository with papers and the backend keeps track of statistics. The most downloaded papers tend to be introductory or material and overviews.
  8. ResearchGate numbers: Publications, reads, citations and impact points.
  9. ResearcherID (Thomson Reuters) numbers: total articles in publication list, articles with citation data, sum of the time cited, average citations per article, h-index.
  10. Microsoft Academic Search numbers.
  11. Count in the dblp computer science bibliography (the Trier database).
  12. Count of listings in ArXiv.
  13. Counts in Semantic Scholar.
  14. ACM digital library counts.

 

Did Pinski’s and Narin’s ‘basic research’ have any influence on PageRank?

Posted on Updated on

In a letter to the Danish newspaper Politiken a group of young researchers, Anders Søgaard, Rebecca Adler-Nissen, Steffen Dalsgaard, Vibe Gedsø Frøkjær, Kristin Veel, Sune Lehmann and Kresten Lindorff-Larsen wrote against letting business get too much influence on the universities. Among their arguments was one example with the Pinski-Narin paper:

“A good example on basic research, which has made a huge economical difference, is Gabriel Pinski’s and Francis Narin’s article about citation analysis from 1976. That article made the PageRank algorithm possible, which still is used in Google Search. According to some statistics Google Search can account for 2 percent of the BNP of the world, all because of research in how researchers cites each others article” (translated from Danish)

The specific paper is Citation influence for journal aggregates of scientific publications: theory, with application to the literature of physics published in ‘Information Processing & Management’. In this paper the two researchers set up a citation matrix corresponding to a graph where the nodes are “publishing entities” such as “journals, institutions, individuals, fields of research, geographical subdivisions or levels of research methodology”. They perform an eigenvalue computation to find the ‘influence’ of the publishing entities. The method is demonstrated on the citation network between physics journals.

Is Pinsk-Narin basic research and did it influence Brin and Page for PageRank?

Interestingly, the two researchers are not university researchers. Pinski and Narin worked in the company “Computer Horizons, Inc” as President and Research Advisor according to the information in the article, being support by the National Science Foundation.

The Pinski-Narin paper is cited by Jon Kleinberg in his Hubs, authorities, and communities paper from December 1999. Pinski-Narin is also cited by Kleinberg’s Authoritative Sources in a Hyperlinked Environment that was published as an IBM research report in May 1997, i.e., a company report.

Brin’s and Page’s famous article The anatomy of a large-scale hypertextual Web search engine (made while they were students at Stanford University) has no mentioning of Pinski and Narin. So were they not aware of it? Initially I thought so.

However, Brin’s and Page’s paper cite Kleinberg’s ‘Authoritative Sources in a Hyperlinked Environment’ which has information about Pinski-Narin, so if Brin and Page read Kleinberg’s paper they must have known about Pinski-Narin, – at least in the latter part of 1997.

The Brin-Page paper is from the Seventh International World Wide Web Conference which was held in April 1998 with submission deadline in December 1997. The tracing of PageRank leads further back to Lawrence Page’s patent US 6285999 with filing date in January 1998 and a priority date in January 1997. This patent has a citation to Pinski-Narin. It is not clear when the citation was added to the patent. I suppose it could be somewhere between during the writing process leading up to the priority date in 1997 and the publication date in 2001. I have not been able to find information about whether the Pinski-Narin influenced Page to PageRank, but in late 1997 they must have been aware of the paper, so it is not at all unlikely that they were inspired from it. However, as an argument for keeping business out of universities the PageRank/Pinski-Narin issue seems a poor example because Pinski-Narin came from a company.

The entire field of scientometrics has depended quite heavily on data from the Science Citation Index (SCI), – a data from the company ‘Institute of Scientific Information’. Indeed, the Pinski-Narin paper used data from SCI. Still the scientometrics field is dominated by commercial interests. Thomson-Reuter now owns SCI, Elsevier has Scopus and Google Google Scholar. Also note that CiteSeer/ResearchIndex was developed, not by a university, but by the American research branch of the Japanese company NEC. And in turn (according to Wikipedia) SCI was “heavily influenced” by the non-academic Shepard’s Citations.

Interestingly, Massimo Franceschet has written on the history of PageRank: “PageRank: Standing on the shoulders of giants” and tracing it back to Wassily W. Leontief in 1941. Wikipedia’s PageRank article also mentions Yanhong Li‘s work Toward a qualitative search engine and US 5920859. At the time Li worked for a company “GARI Software/IDD Information Services” and later cofounded Baidu.

It may be worth to note the lack of references in the Pinski-Narin paper. It has no citation to, e.g., Leo Katz’ 1953 paper or Leontief. Perhaps they were unaware of the research in the other areas?

Although PageRank can be said to depend on university-based basic research such as German-speaking matematicians Oskar Perron, Ferdinand Georg Frobenius and Richard von Mises the work in the Computer Horizons company is not an example of university-based basic research.

One final note: Though some academics may see PageRank as an example of a basic numerical research yielding a company of great economic value, I see it as only a component in the Google success. The application of low-cost Linux computers together with a non-intrusive quick-responding interface may well explain more of the success. Linux, inspired by academic MINIX, is mostly an non-academic endeavor.

Google scholar citations for Responsible Business in the Blogosphere project

Posted on Updated on

GS Year First author Title
35 2011 Finn Årup Nielsen A new ANEW: evaluation of a word list for sentiment analysis in microblogs
24 2011 Gerardo Patriotta Maintaining legitimacy: controversies, orders of worth and public justifications
19 2011 Annemette Leonhardt Kjærgaard Mediating identity: a study of media influence on organizational identity construction in a celebrity firm
16 2011 Lars Kai Hansen Good friends, bad news – affect and virality in Twitter
10 2012 Adam Arvidsson Value in informational capitalism and on the Internet
8 2010 Adam Arvidsson The ethical economy: new forms of value in the information society
7 2011 Adam Arvidsson Ethics and value in customer co-production
5 2012 Chitu Okoli The people’s encyclopedia under the gaze of the sages: a systematic review of scholarly research on Wikipedia
5 2013 Finn Årup Nielsen Wikipedia research and tools: review and comments
4 2011 Toke Jansen Hansen Non-parametric co-clustering of large scale sparse bipartite networks on the GPU
4 2011 Friederike Schultz Strategic framing in the BP crisis: a semantic network analysis of associative frames
1 2012 Michael Kai Petersen Cognitive semantic networks: emotional verbs throw a tantrum but don’t bite
1 2010 Michael Etter On relational capital in social media
1 2011 Mette Morsing State-owned enterprises: a corporatization of governments
0 2013 Elanor Colleoni CSR communication for organizational legitimacy in social media
0 2013 Anne Vestergaard Humanitarian appeal and the paradox of power
0 2011 Bjarne Ørum Wahlgreen Large scale topic modeling made practical
0 2012 Michael Kai Petersen On an emotional node: modeling sentiment in graphs of action verbs
0 ? Friederike Schultz The construction of corporate social responsibility in network society: a communication
0 2013 Adam Arvidsson The potential of consumer publics

I typed in many of the publications from our Responsible Business in the Blogosphere project into the Brede Wiki together with an identifier for Google Scholar for each publication. With a script (do not run code you obtained from a wiki!) I am able to collect the citation information from Google Scholar using the MediaWiki API, the categories and the templates. Here it is sorted according to number of citations.

Are you on Google Scholar?

Posted on Updated on

Gouttescholar

Google introduced (was it a few weeks ago) a new version of Google Scholar where you as a scientist can claim your name and your scientific papers that you have authored. Previously you could just search, e.g., to get your papers listed, see my previous blog post. However, if you got a common name, e.g., “J. Larsen” you would run into the problem that your publications would be entangled with the publications of other people called “J. Larsen” or “RJ Larsen” or “JC Larsen”, etc. With the new system it almost seems that Google does co-author mining so they are better to distinguish the different similar-named authors. Furthermore, – and most important – with a Google Scholar account you can claim your papers which solves the ambiguity problem, – and you can add and merge papers. Editing functionality was already present in CiteSeer long ago (if I remember correctly) and in Microsoft Academic Search you can also do editing of the publication list.

You can see my Google Scholar account here. By a strange coincidence I have found that my number of citations is presently exactly the same as one of my co-authors, Cyril Goutte: 1668.

The new Google Scholar functionality seems not to be that good in discovering new relevant papers, e.g., those papers that cite you. There the old fashion Google Scholar email alert seems better. What is does provide is a nice overview for h-index junkies. The number is automatically computed and makes Google Scholar a serious competitor the the pay-walled ISI Web of Science.

ReaderMeter: readership analytics for scientists

Posted on

For those up to social web metrics here is yet another one: ReaderMeter! It is based on bookmark data from the research paper sharing social web site Mendeley and ReaderMeter computes a bookmark-based h-index to evaluate one’s readership impact. Thus it is mostly for scientists.

The ReaderMeter service is constructed by Dario Taraborelli who also co-organized the altmetrics workshop recently held in Koblenz. He is now at the Wikimedia Foundation and furthermore co-organizes the WikiViz Wikipedia visualization challenge.

As with Google Scholar and Microsoft Academic Search ReaderMeter has problems with name variations and disambiguation. I am found under Finn A Nielsen and Finn Arup Nielsen and Finn Nielsen and Finn Aarup Nielsen. Taraborelli assures us that “Spelling variants will be addressed in the next major upgrade.” My “Finn Nielsen” clashes with another “Finn Nielsen”. As with Google Scholar and the Microsoft service there are also identification issues for individual papers. There are two listed items (1 and 2) with the same DOI in ReaderMeter.

Below I have tried to aggregate my papers across naming variations:

# Article Readership/Bookmarks
1 Nielsen 2006 23
2 Nielsen 2009 16
3 Frokjaer 2008 14
4 Nielsen 2007 14
5 Hansen 2011 13
6 Balslev 2006 13
7 Kalbitzer 2009 12
8 Balslev 2005 12
9 Pennock 2001 11
10 Kalbitzer 2010 11
11 Nielsen 2009 11
12 Nielsen 2005 9
13 Balslev 2002 8

So the metrics seems to be somewhat skewed and independent of the Google Scholar citations… Or what? Where is my co-author Cyril Goutte, who wrote highly cited papers 10 years ago?

Goutte’s Modeling the hemodynamic response in fMRI using smooth FIR filters amasses 92 Google Scholar citations but only 1 Mendeley bookmark according to ReaderMeter! But there are two entries for the article on Mendeley: one with one reader and another with 21 readers.

I cannot find On clustering fMRI time series in ReaderMeter. It got 44 “Readers on Mendeley” and 217 Google Scholar citations.

ReaderMeter is interesting, but its present version seems to suffer from a few “child diseases” for it to be a full fledged recommendable service (at least for the papers I examined). Microsoft Academic Search has an ‘Edit’ button where you can merge authors, merge publications and do a few other things (you need to sign in with Live ID to gain this functionality). ReaderMeter may very well improve if it implements similar functions. It is unclear for me how open Microsoft Academic Search is. Taraborelli’s ReaderMeter is on CC-by-sa so users may be more willing to spend time on merging and disambiguating authors and papers on ReaderMeter.

 

(Update 6 July 2011: When I wrote this blog post I was lazy and didn’t look on the good blogs that have already touched upon the same issues that I mentions.Taraborelli himself has a nice blog post and Rod Page also has a good one. Apart from author name normalization, article deduplication and author disambiguation Taraborelli also mentions a possible readership selection bias.)