NeurIPS in Wikidata

Posted on Updated on

scholia-neurips-2019-co-authors.png
Co-authors in the NeurIPS 2019 conference based on data in Wikidata. Screenshot based on Scholia at https://tools.wmflabs.org/scholia/event/Q61582928.

The machine learning and neuroinformatics conference NeurIPS 2019 (NIPS 2019) takes place in the middle of December 2019. The conference series has always had a high standing and has grown considerably in reputation in recent years.

All papers from the conference are available online at papers.nips.cc. There is – to my knowledge – little structured metadata associated with the papers, though the website is consistently formatted in HTML and metadata can thus relatively easy be scraped. There are no consistent identifiers that I know of that identifies the papers on the site: No DOI, no ORCID iD or anything else. A few papers may be indexed here and there on third-party sites.

In the Scholia python package, I have made a module for scraping the papers.nips.cc website and convert the metadata to the Quickstatement format for Magnus Manske’s web application that submits the data to Wikidata. The entry of the basic metadata about the papers from NeurIPS is more or less complete. A check is needed to see if all is entered. One issue that the Python code attempts to counter is the cases where the scraped paper is already entered in Wikidata. Given that there are no identifiers the match attempt is somewhat heuristic.

Authors have separate webpages on papers.nips.cc with listing of papers published at the conference. This is quite well-curated, though I have discovered authors that have several webpages associated: The Bayesian Carl Edward Rasmussen is under http://papers.nips.cc/author/carl-edward-rasmussen-1254, https://papers.nips.cc/author/carl-e-rasmussen-2143 and http://papers.nips.cc/author/carl-edward-rasmussen-6617. Joshua B. Tenenbaum is also split.

Authors are not resolved with the code from Scholia. They are just represented as strings. The Author Disambiguator tool that Arthur P. Smith has built from a tool by Magnus Manske can semi-automatically resolve authors, i.e., associate the author of a paper with a specific Wikidata item representing a human. The Scholia web site has particular pages (“missing”) that make contextual links to the Author Disambiguator. For the NeurIPS 2019 proceedings the links can be seen at https://tools.wmflabs.org/scholia/venue/Q68600639/missing. There are currently over 1,400 authors that needs to be resolved. Some of these are not easy. Multiple authors may share the same name, e.g., some European names, e.g., Andreas Krause, and I have difficulty knowing how unique East Asian names are. So far only 50 authors from the NeurIPS conference have been resolved.

There is no citation information when the data is first entered with the Scholia and Quickstatement tools. There are currently no means to automatically enter that information. NeurIPS proceedings are – as far as I know – not available through CrossRef.

Since there is little editorial control of the format of the references the come in various shapes and may need “interpretation”. For instance, “Semi-Supervised Learning in Gigantic Image Collections” claims a citation to “[3] Y. Bengio, O. Delalleau, N. L. Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning
eigenfunctions links spectral embedding and kernel PCA. In NIPS, pages 2197–2219, 2004.” But that is unlikely a NIPS paper, and the reference should likely go to Neural Computation.

The ontology of Wikidata for annotating what papers are about is not necessarily good. Some concepts in cognitive sciences, including psychology, machine learning and neuroscience, become split or merged, e.g., Reinforcement learning is an example where the English Wikipedia article focus on the machine learning aspect, while Wikidata also tag neuroscience-oriented articles with the concept. For many papers I find it difficult to link to the ontology as the topic of the paper is so specialized that it is difficult to identify an appropriate Wikidata item.

With the data in Wikidata, it is possible to see many aspect of the data with the Wikidata Query Service and Scholia. For instance,

  1. Who has the most papers at NeurIPS 2019? A panel of a Scholia page readily shows this to be Sergey Levine, Francis Bach, Pieter Abbeel and Yoshua Bengio.
  2. The heuristically computed topic scores on the event page for NeurIPS 2019 show that reinforcement learning, generative adversarial network, deep learning, machine learning and meta-learning are central topics this year. (here one needs to keep in mind that the annotation in Wikidata is imcomplete)
  3. Which Danish researcher has been listed as an author on most NeurIPS papers through time? This is possible to ask with a query to the Wikidata Query Service: https://w.wiki/DTp. It depends upon what is meant by “Danish”. Here it is based on employment/affiliation and gives Carl Edward Rasmussen, Ole Winther, Ricardo Henao, Yevgeny Seldin, Lars Kai Hansen and Anders Krogh.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s