Do we have a final schema for Wikicite?

Posted on Updated on

No, Virginia, we do not have a final schema for Wikicite IMHO.

Wikicite is a project that focuses on sources in the Wikimedia universe. Currently, the most active part of Wikicite is the setup of bibliographic data from scientific articles in Wikidata with the tools of Magnus Manske, the Fatameh-duo and the GeneWiki people, and particular James Hare, Daniel Mietchen and Magnus Manske have been active in automatic and semi-automatic setup of data. Jakob Voß’ statistics says we have – as of medium October 2017 – metadata from almost 10 million publications in Wikidata and recorded over 36 million citation between the described works.

Given that so many bibliographic items have been setup in Wikidata it may be worth to ask whether we actually have a schema for the setup of this data. While we surely have sort-of a convention that tools and editors follow it is not complete and probably up for change.

Here are some Wikicite-related schema issues:

  1. What instance is a scientific article? Most tools use instance of Q13442814, currently “scientific article” in English. But what is this? In English “scientific” means something different than the usual translation into Danish (“videnskabelig”) or German (“wissenschaftlicher“), – and these words are used in the labels of Q13442814. “Scientific” usually only entails natural science, leaving out social science and the humanities (while “videnskabelig”/”wissenschaftlicher” entails social science and humanities too). An attempt to fix this problem is to call these articles “scholarly articles”. It is interesting to think that what is one of the most used classes in Wikidata – if not the most used class – has an language ambiguity. I see no reason to restricted Q13442814 to only the English sense of science. It is all too difficult to distinguish between scientific disciplines: Think of computational humanities.
  2. What about the ontology of scientific work? Currently, Q13442814 is set as a subclass of academic journal articles, but this is not how we use it as conference articles in proceedings are set to Q13442814. Is a so-called abstract a “scientific article”? “Abstracts” appear, e.g., in neuroimaging conferences, where they are full referenceable items published in proceedings or supplementary journal issues.
  3. What is the instances of scientific article in Wikidata describing? A work or an edition? What happens if the article is reprinted (it happens to important work)? Should we then create a new item? Or amend the old item? If we create a new item then how should we link the two? Should we create a third item as a work item? Should items in preprint archives have their own item? Should that issue depend on whether the preprint version and the canonical version are more or less the same?
  4. How do we represent the language of an article? There are two generally used properties: original language of work and language of the work. There is a discussion about deleting one of them.
  5. How do we represent an author? Today an author can be linked to the article via the P50 property. However, the author label may be different than the name written in the article (we may refer to this issue as the “Natalie Portman Problem” as she published a scientific article as “Natalie Hershlag”). P1932 as a qualifier to P50 may be used to capture the way that the name is represented in the article, – a possible solution. Recently, Manske’s author name resolver has started to copy the short author name to the qualifier under P50. For referencing, there is still the problem that the referencing software would likely need to determine the surname, and this is not trival for authors with suffixes and Spanish authors with multiple surnames.
  6. How do we record the affiliation of a paper. Publicly funded universities and other research entities would like to make statistics on, for instance, the paper production, but this is not possible to do precisely with today’s Wikidata as papers are usually not affiliated with organizations, – only indirectly by the author affiliation. And the author affiliation might change as the author moves between different institutions. We already noted this problem in the first article we wrote about Scholia.
  7. How do record the type of scientific publication? There are various subtypes, e.g., systematic review, original article, erratum, “letter”, etc. Or the state of the article: submitted, under-review, peer-review, not peer-reviewed. The “genre” and the “instance of” properties can be used, but I have seen no ruling convention.
  8. How do we record what software and which datasets have been used in the article, e.g., for digital preservation. Currently, we are using “used” (P2283). But should we have dedicated properties, e.g., “uses software“? Do we have a schema for datasets and software?
  9. How do we record the formatting of the title, e.g., case? Bibliographic reference management software may choose to capitalize some words. In BibTeX you have the possibility to format the title using LaTeX commands. Detailed formatting of titles in Wikidata is currently not done, and I do not believe we have dedicated properties to handle such cases.
  10. How do we manage journals that change titles? For instance, for BMJ we have several items covering the name changes: Q546003, Q15746654, and Q28464921. Is this how we should do? There is the P156 property to connect subsequent versions.
  11. How should we handle series of conference proceedings? A particular article can  be “published in” a proceedings and such a proceedings may be part of a “series” that is a “conference proceedings series“. However, according to my recollection some/one(?) Wikidata bot may link articles directly as “published in” the conference proceedings series: they can have ISSNs and look like ordinary scientific journals.
  12. When is an article published? You have a number of publishers setting a formal publication date in the future for an article that is actually published prior to that formal date. In Wikidata there is to my knowledge only a single property for publication date. Preprints yield other publication dates.
  13. A minor issue is P820, arXiv classification. According to documentation it should be used as a qualifier to P818, the arXiv identifier property. Embarrassingly, I overlooked that and the Scholia arXiv extraction program and Quickstatement generator outputs it/them as a proper property.

Update:

Do we have a schema for datasets and software? Well, yes, Virginia. For software Katherine Thornton & Co. have produced Modeling the Domain of Digital Preservation in Wikidata.

Leave a comment