Suppose you want to measure the performance of individual researchers of a university department. Which variables can you get hold on and how relevant would they be to measure academic performance?
Here is my take on it:
- Google Scholar citations number. Google Scholar records total number of citations, h-index and i10-index as well as the numbers for a fixed period.
- Scopus citation numbers.
- Twitter. The number of tweets and the number of followers would be relevant.
One issue here is that the number of tweets may not be relevant to the academic performance and it is also susceptible to manipulation. Interestingly there has been a comparison between Twitter numbers and standard citation counts with a coefficient between the two numbers named the Kardashian index.
- Wikidata and Wikipedia presence. Whether Wikidata has a item of the researcher, the number of articles of the researchers, the number of bytes they span, the number of articles recorded in Wikidata. There is an API to get these numbers, and – interestingly – Wikidata can record a range of other identifiers for Google Scholar, Scopus, Twitter, etc. which would make it a convenient open database for keeping track of researcher identifiers across sites of scientometric relevance.
The number of citations in Wikipedia to the work of a researcher would be interesting to have, but is somewhat more difficult to automatically obtain.
The numbers of Wikipedia and Wikidata are a bit manipulable.
- Stackoverflow/Stackexchange points in relevant areas. The question/answering sites under the Stackexchange umbrella have a range of cites that are of academic interest. In my area, e.g., Stackoverflow and Cross Validated.
- GitHub repositories and stars.
- Publication download counts. For instance, my department has a repository with papers and the backend keeps track of statistics. The most downloaded papers tend to be introductory or material and overviews.
- ResearchGate numbers: Publications, reads, citations and impact points.
- ResearcherID (Thomson Reuters) numbers: total articles in publication list, articles with citation data, sum of the time cited, average citations per article, h-index.
- Microsoft Academic Search numbers.
- Count in the dblp computer science bibliography (the Trier database).
- Count of listings in ArXiv.
- Counts in Semantic Scholar.
- ACM digital library counts.
In a letter to the Danish newspaper Politiken a group of young researchers, Anders Søgaard, Rebecca Adler-Nissen, Steffen Dalsgaard, Vibe Gedsø Frøkjær, Kristin Veel, Sune Lehmann and Kresten Lindorff-Larsen wrote against letting business get too much influence on the universities. Among their arguments was one example with the Pinski-Narin paper:
“A good example on basic research, which has made a huge economical difference, is Gabriel Pinski’s and Francis Narin’s article about citation analysis from 1976. That article made the PageRank algorithm possible, which still is used in Google Search. According to some statistics Google Search can account for 2 percent of the BNP of the world, all because of research in how researchers cites each others article” (translated from Danish)
The specific paper is Citation influence for journal aggregates of scientific publications: theory, with application to the literature of physics published in ‘Information Processing & Management’. In this paper the two researchers set up a citation matrix corresponding to a graph where the nodes are “publishing entities” such as “journals, institutions, individuals, fields of research, geographical subdivisions or levels of research methodology”. They perform an eigenvalue computation to find the ‘influence’ of the publishing entities. The method is demonstrated on the citation network between physics journals.
Is Pinsk-Narin basic research and did it influence Brin and Page for PageRank?
Interestingly, the two researchers are not university researchers. Pinski and Narin worked in the company “Computer Horizons, Inc” as President and Research Advisor according to the information in the article, being support by the National Science Foundation.
The Pinski-Narin paper is cited by Jon Kleinberg in his Hubs, authorities, and communities paper from December 1999. Pinski-Narin is also cited by Kleinberg’s Authoritative Sources in a Hyperlinked Environment that was published as an IBM research report in May 1997, i.e., a company report.
Brin’s and Page’s famous article The anatomy of a large-scale hypertextual Web search engine (made while they were students at Stanford University) has no mentioning of Pinski and Narin. So were they not aware of it? Initially I thought so.
However, Brin’s and Page’s paper cite Kleinberg’s ‘Authoritative Sources in a Hyperlinked Environment’ which has information about Pinski-Narin, so if Brin and Page read Kleinberg’s paper they must have known about Pinski-Narin, – at least in the latter part of 1997.
The Brin-Page paper is from the Seventh International World Wide Web Conference which was held in April 1998 with submission deadline in December 1997. The tracing of PageRank leads further back to Lawrence Page’s patent US 6285999 with filing date in January 1998 and a priority date in January 1997. This patent has a citation to Pinski-Narin. It is not clear when the citation was added to the patent. I suppose it could be somewhere between during the writing process leading up to the priority date in 1997 and the publication date in 2001. I have not been able to find information about whether the Pinski-Narin influenced Page to PageRank, but in late 1997 they must have been aware of the paper, so it is not at all unlikely that they were inspired from it. However, as an argument for keeping business out of universities the PageRank/Pinski-Narin issue seems a poor example because Pinski-Narin came from a company.
The entire field of scientometrics has depended quite heavily on data from the Science Citation Index (SCI), – a data from the company ‘Institute of Scientific Information’. Indeed, the Pinski-Narin paper used data from SCI. Still the scientometrics field is dominated by commercial interests. Thomson-Reuter now owns SCI, Elsevier has Scopus and Google Google Scholar. Also note that CiteSeer/ResearchIndex was developed, not by a university, but by the American research branch of the Japanese company NEC. And in turn (according to Wikipedia) SCI was “heavily influenced” by the non-academic Shepard’s Citations.
Interestingly, Massimo Franceschet has written on the history of PageRank: “PageRank: Standing on the shoulders of giants” and tracing it back to Wassily W. Leontief in 1941. Wikipedia’s PageRank article also mentions Yanhong Li‘s work Toward a qualitative search engine and US 5920859. At the time Li worked for a company “GARI Software/IDD Information Services” and later cofounded Baidu.
It may be worth to note the lack of references in the Pinski-Narin paper. It has no citation to, e.g., Leo Katz’ 1953 paper or Leontief. Perhaps they were unaware of the research in the other areas?
Although PageRank can be said to depend on university-based basic research such as German-speaking matematicians Oskar Perron, Ferdinand Georg Frobenius and Richard von Mises the work in the Computer Horizons company is not an example of university-based basic research.
One final note: Though some academics may see PageRank as an example of a basic numerical research yielding a company of great economic value, I see it as only a component in the Google success. The application of low-cost Linux computers together with a non-intrusive quick-responding interface may well explain more of the success. Linux, inspired by academic MINIX, is mostly an non-academic endeavor.
I typed in many of the publications from our Responsible Business in the Blogosphere project into the Brede Wiki together with an identifier for Google Scholar for each publication. With a script (do not run code you obtained from a wiki!) I am able to collect the citation information from Google Scholar using the MediaWiki API, the categories and the templates. Here it is sorted according to number of citations.
Google introduced (was it a few weeks ago) a new version of Google Scholar where you as a scientist can claim your name and your scientific papers that you have authored. Previously you could just search, e.g., to get your papers listed, see my previous blog post. However, if you got a common name, e.g., “J. Larsen” you would run into the problem that your publications would be entangled with the publications of other people called “J. Larsen” or “RJ Larsen” or “JC Larsen”, etc. With the new system it almost seems that Google does co-author mining so they are better to distinguish the different similar-named authors. Furthermore, – and most important – with a Google Scholar account you can claim your papers which solves the ambiguity problem, – and you can add and merge papers. Editing functionality was already present in CiteSeer long ago (if I remember correctly) and in Microsoft Academic Search you can also do editing of the publication list.
The new Google Scholar functionality seems not to be that good in discovering new relevant papers, e.g., those papers that cite you. There the old fashion Google Scholar email alert seems better. What is does provide is a nice overview for h-index junkies. The number is automatically computed and makes Google Scholar a serious competitor the the pay-walled ISI Web of Science.
For those up to social web metrics here is yet another one: ReaderMeter! It is based on bookmark data from the research paper sharing social web site Mendeley and ReaderMeter computes a bookmark-based h-index to evaluate one’s readership impact. Thus it is mostly for scientists.The ReaderMeter service is constructed by Dario Taraborelli who also co-organized the altmetrics workshop recently held in Koblenz. He is now at the Wikimedia Foundation and furthermore co-organizes the WikiViz Wikipedia visualization challenge.
As with Google Scholar and Microsoft Academic Search ReaderMeter has problems with name variations and disambiguation. I am found under Finn A Nielsen and Finn Arup Nielsen and Finn Nielsen and Finn Aarup Nielsen. Taraborelli assures us that “Spelling variants will be addressed in the next major upgrade.” My “Finn Nielsen” clashes with another “Finn Nielsen”. As with Google Scholar and the Microsoft service there are also identification issues for individual papers. There are two listed items (1 and 2) with the same DOI in ReaderMeter.Below I have tried to aggregate my papers across naming variations:
So the metrics seems to be somewhat skewed and independent of the Google Scholar citations… Or what? Where is my co-author Cyril Goutte, who wrote highly cited papers 10 years ago?Goutte’s Modeling the hemodynamic response in fMRI using smooth FIR filters amasses 92 Google Scholar citations but only 1 Mendeley bookmark according to ReaderMeter! But there are two entries for the article on Mendeley: one with one reader and another with 21 readers. I cannot find On clustering fMRI time series in ReaderMeter. It got 44 “Readers on Mendeley” and 217 Google Scholar citations. ReaderMeter is interesting, but its present version seems to suffer from a few “child diseases” for it to be a full fledged recommendable service (at least for the papers I examined). Microsoft Academic Search has an ‘Edit’ button where you can merge authors, merge publications and do a few other things (you need to sign in with Live ID to gain this functionality). ReaderMeter may very well improve if it implements similar functions. It is unclear for me how open Microsoft Academic Search is. Taraborelli’s ReaderMeter is on CC-by-sa so users may be more willing to spend time on merging and disambiguating authors and papers on ReaderMeter.
(Update 6 July 2011: When I wrote this blog post I was lazy and didn’t look on the good blogs that have already touched upon the same issues that I mentions.Taraborelli himself has a nice blog post and Rod Page also has a good one. Apart from author name normalization, article deduplication and author disambiguation Taraborelli also mentions a possible readership selection bias.)
I have previously blogged about the Milena Penkowa case that has entertained the Danish research community in the first half of 2011. If you want an English update there is an overview in the April article Penkowa for dummies.One of the latest to jump on the wagon for Penkowa bashing is geologist Peter Riisager. Back in March he looked on the self-citations of Penkowa and reported it on his blog. He found that 54% of Penkowa’s citations where her own. The story was picked up a couple of weeks ago by the university newspaper Danish and English as well as a Danish science web-site. When Riisager finding that Penkowa has over 50% self-citations he links to a Nature blogger that claims that “Bad guys have > 50% self-citations” and “good guys have self-citations as < 50% of total cites (I [Brian Derby] am at 25%)”. qed: Penkowa is a bad guy. But is Riisager (and blogger Brian Derby) right? I cannot find out which method he used. 50% self-citations sounds fairly much. How can we investigate this further? Well, here is my methodology: I use ISI Web of Science, search on an author, press “Create Citation Report” to get number of articles the author has written (“Results found”) and the number of citations (“Sum of the Times Cited”), For the number of non-self citations I press “View without self-citations” and read off “Result: ” in the upper left corner of the web-page. Is that an ok procedure? Nah. I think the problem is that “Sum of the Times Cited” refers to the number of citations while “View without self-citations” refers to the number of papers with citations without self-citations. What we should (also) do is to get the number of papers with citations (“View Citing Articles”). The problem is that there are multiple citations in each paper. What we also would like to have is the number of citations without self-citations, but I don’t know how to get that number from ISI Web of Science. Below I have attempted a count on Milena Penkowa, Peter Riisager, myself and big shot neuroimaging analyzer Karl J. Friston. The “self-citation rate (A)” is computed what I believe is the wrong way (citations-Papers with non-self citations)/citations, while “self-citation rate (B)” is computed by the number of citing papers (Papers with citations – Papers with non-self citations)/Papers with citations.
|Author||Papers||Citations||Papers with citations||Papers with non-self citations||Self-citation rate (A)||Self-citation rate (B)|
In his blog post from 8 March 2011 Riisager writes that Penkowa has a total of 2,401 citations where 1296 are self-citations. With my “wrong” methodology I get 2481-1179 = 1302 self-citations, – pretty close to the numbers of Riisager. So are Riisager mixing up the units: papers and citations? Or how did he get his numbers? The “wrong” (A)-way of computing the self-citation rate seems way off. If you take the (A) self-citation rate of Friston you get 46%. This seems to be an outragous rate. Surely of Friston’s many citations 46% is not generated by himself. That would put him near Brain Derby’s “bad guy”… As long as we do not have the number of citations without self-citations – only the number of papers with citations without self-citations – we can only use that. And if we now look on Penkowa’s self-citation rate it is not over 50% but rather 6.5%. That value is actually lower than the self-citation rate I compute for Peter Riisager! So who is laughing now? I must admit I am not completely sure on my methodology. To investigate the issue fully one may need to download all the papers and count the citations so we can understand the ISI Web of Science values. My (B)-method gives me a self-citation rate on 2.9%. I think on Google Scholar I have a higher number of self-citations as Google Scholar is indexing all my slides. As I tend to reference myself on the slides my number of citations gets boosted, and it may partially explain why my Google Scholar h-index is higher than my ISI Web of Science h-index.
(2012-03-07: language correction)