Latest Event Updates

Find titles of all works published by DTU Cognitive Systems in 2017

Posted on Updated on

Find titles of all works published by DTU Cognitive Systems in 2017! How difficult can that be? To identify all titles of works from a research organization? With Wikidata and the Wikidata Query Service (WDQS) at hand it shouldn’t be that difficult to do? Nevertheless, I ran into a few hatches:

  1. There is what we can call the “Nathan Churchill Problem”: Nathan Churchill was at one point affiliated with our research section Cognitive Systems and wrote papers, e.g., together with our Morten Mørup. One paper clearly identifies him as affiliated with our section. Searching the DTU website yields no homepage for him though. He is now at St. Michael’s Hospital, Toronto according to a newer paper. So is he no longer affiliated with the Cognitive Systems section? That’s somewhat difficult to establish with credible and citable sources. If he is not, then any simple SPARQL query on the WDQS for Cognitive Systems papers will yield his new papers which shouldn’t be counted as Cognitive Systems section papers. If we could point to a source that indicates whether his affiliation at our section is stopped we could add a qualifier to the P1416 property in his Wikidata entry and extend the SPARQL query. What I ended up doing, was to explicitly filter out two of Churchill’s publications with the ugly line “FILTER(?work != wd:Q42595201 && ?work != wd:Q36384548)“. The problem is of course not just confined to Churchill. For instance, Scholia currently lists new publications by our Søren Hauberg at the Scholia page for DIKU, – a department where he has previously been affiliated. We discussed the affiliation problem a bit in the Scholia paper, see page 253 (page 17).
  2. Datetime datatype conversion with xsd:dateTime. The filter on date is with this line: “FILTER(?publication_datetime >= "2017-01-01"^^xsd:dateTime)“. Something like “FILTER(?publication_datetime >= xsd:dateTime(2017))” does not work.
  3. Missing data. It is difficult to establish how complete the Wikidata listing is for our section with respect to publications. Scraping Google Scholar, PubMed and our local university database of publications could be a possibility, but this is far from streamlined with the tools I have developed.

The full query is listed below and the result is available from this link. Currently, 48 results are returned.

#defaultView:Table
SELECT ?workLabel 
WITH {
  SELECT 
    ?work (MIN(?publication_datetime) AS ?datetime)
  WHERE {
    # Find CogSys work
    ?researcher wdt:P108 | wdt:P463 | wdt:P1416/wdt:P361* wd:Q24283660 .
    ?work wdt:P50 ?researcher .
    ?work wdt:P31 wd:Q13442814 .
    
    # Nathan Churchill seems not longer to be affiliated!?
    FILTER(?work != wd:Q42595201 && ?work != wd:Q36384548)
    
    # Filter to year 2017
    ?work wdt:P577 ?publication_datetime .
    FILTER(?publication_datetime >= "2017-01-01"^^xsd:dateTime)
  }
  GROUP BY ?work 
} AS %results
WHERE {
  INCLUDE %results
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,es,fr,jp,nl,nl,ru,zh". }
}

 

Advertisements

Can you scrape Google Scholar?

Posted on

With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=5585223 will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g.,  https://www.wikidata.org/wiki/Q41620192#P2860. CrossRef may be another resource.

Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.

The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.

The Google robots.txt limits automated access with the following relevant lines:

Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?*cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholar_share

“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.

So if there is some information you can get from Google Scholar is it worth it?

The Scholia code now adds a googlescholar.py module with some preliminary Google Scholar processing attempts. There is command-line based scraping of a researcher profile. For instance,

python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ

It ain’t not working too well. As far as I can determine you need to page with JavaScript to get more than the initial 20 results (it would be interesting to examine the Publish or Perish software to see how a larger set of results is obtained). Not all bibliographic metadata is available for each item on the Google Scholar page – as far as I see: No DOI. No PubMed identifier. The author list may be abbreviated with an ellipsis (‘…’). Matching of the Google Scholar item with data already present in Wikidata seems not that straightforward.

It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.

Do we have a final schema for Wikicite?

Posted on Updated on

No, Virginia, we do not have a final schema for Wikicite IMHO.

Wikicite is a project that focuses on sources in the Wikimedia universe. Currently, the most active part of Wikicite is the setup of bibliographic data from scientific articles in Wikidata with the tools of Magnus Manske, the Fatameh-duo and the GeneWiki people, and particular James Hare, Daniel Mietchen and Magnus Manske have been active in automatic and semi-automatic setup of data. Jakob Voß’ statistics says we have – as of medium October 2017 – metadata from almost 10 million publications in Wikidata and recorded over 36 million citation between the described works.

Given that so many bibliographic items have been setup in Wikidata it may be worth to ask whether we actually have a schema for the setup of this data. While we surely have sort-of a convention that tools and editors follow it is not complete and probably up for change.

Here are some Wikicite-related schema issues:

  1. What instance is a scientific article? Most tools use instance of Q13442814, currently “scientific article” in English. But what is this? In English “scientific” means something different than the usual translation into Danish (“videnskabelig”) or German (“wissenschaftlicher“), – and these words are used in the labels of Q13442814. “Scientific” usually only entails natural science, leaving out social science and the humanities (while “videnskabelig”/”wissenschaftlicher” entails social science and humanities too). An attempt to fix this problem is to call these articles “scholarly articles”. It is interesting to think that what is one of the most used classes in Wikidata – if not the most used class – has an language ambiguity. I see no reason to restricted Q13442814 to only the English sense of science. It is all too difficult to distinguish between scientific disciplines: Think of computational humanities.
  2. What about the ontology of scientific work? Currently, Q13442814 is set as a subclass of academic journal articles, but this is not how we use it as conference articles in proceedings are set to Q13442814. Is a so-called abstract a “scientific article”? “Abstracts” appear, e.g., in neuroimaging conferences, where they are full referenceable items published in proceedings or supplementary journal issues.
  3. What is the instances of scientific article in Wikidata describing? A work or an edition? What happens if the article is reprinted (it happens to important work)? Should we then create a new item? Or amend the old item? If we create a new item then how should we link the two? Should we create a third item as a work item? Should items in preprint archives have their own item? Should that issue depend on whether the preprint version and the canonical version are more or less the same?
  4. How do we represent the language of an article? There are two generally used properties: original language of work and language of the work. There is a discussion about deleting one of them.
  5. How do we represent an author? Today an author can be linked to the article via the P50 property. However, the author label may be different than the name written in the article (we may refer to this issue as the “Natalie Portman Problem” as she published a scientific article as “Natalie Hershlag”). P1932 as a qualifier to P50 may be used to capture the way that the name is represented in the article, – a possible solution. Recently, Manske’s author name resolver has started to copy the short author name to the qualifier under P50. For referencing, there is still the problem that the referencing software would likely need to determine the surname, and this is not trival for authors with suffixes and Spanish authors with multiple surnames.
  6. How do we record the affiliation of a paper. Publicly funded universities and other research entities would like to make statistics on, for instance, the paper production, but this is not possible to do precisely with today’s Wikidata as papers are usually not affiliated with organizations, – only indirectly by the author affiliation. And the author affiliation might change as the author moves between different institutions. We already noted this problem in the first article we wrote about Scholia.
  7. How do record the type of scientific publication? There are various subtypes, e.g., systematic review, original article, erratum, “letter”, etc. Or the state of the article: submitted, under-review, peer-review, not peer-reviewed. The “genre” and the “instance of” properties can be used, but I have seen no ruling convention.
  8. How do we record what software and which datasets have been used in the article, e.g., for digital preservation. Currently, we are using “used” (P2283). But should we have dedicated properties, e.g., “uses software“? Do we have a schema for datasets and software?
  9. How do we record the formatting of the title, e.g., case? Bibliographic reference management software may choose to capitalize some words. In BibTeX you have the possibility to format the title using LaTeX commands. Detailed formatting of titles in Wikidata is currently not done, and I do not believe we have dedicated properties to handle such cases.
  10. How do we manage journals that change titles? For instance, for BMJ we have several items covering the name changes: Q546003, Q15746654, and Q28464921. Is this how we should do? There is the P156 property to connect subsequent versions.
  11. How should we handle series of conference proceedings? A particular article can  be “published in” a proceedings and such a proceedings may be part of a “series” that is a “conference proceedings series“. However, according to my recollection some/one(?) Wikidata bot may link articles directly as “published in” the conference proceedings series: they can have ISSNs and look like ordinary scientific journals.
  12. When is an article published? You have a number of publishers setting a formal publication date in the future for an article that is actually published prior to that formal date. In Wikidata there is to my knowledge only a single property for publication date. Preprints yield other publication dates.
  13. A minor issue is P820, arXiv classification. According to documentation it should be used as a qualifier to P818, the arXiv identifier property. Embarrassingly, I overlooked that and the Scholia arXiv extraction program and Quickstatement generator outputs it/them as a proper property.

Update:

Do we have a schema for datasets and software? Well, yes, Virginia. For software Katherine Thornton & Co. have produced Modeling the Domain of Digital Preservation in Wikidata.

HACK4DK 2017

Posted on Updated on

The HACK4DK is an annual event in Copenhagen, bringing together cultural nerds and computer nerds for building interesting things with cultural data. I have been participating since the very beginning and participated in this year’s HACK4DK which took place at ENIGMA, a to-be museum in Østerbro, Copenhagen.

The winning project among around 19 projects this year was Tin Toy, a neat augmented reality application using images from the toy collection of Holstebro Museum. I believe they used the AR.js Javascript library. There is a YouTube video that attempts to capture the attractiveness of the project:

The result of my struggles with the a-frame Javascript library is available on this page: https://fnielsen.github.io/hack4dk2017/. Under the name “Virtual Gallery of Denmark” it was suppose to be a virtual reality environment with presentation of Danish art. The end result became a somewhat less dynamic but meditative environment with textured panels flying around in a virtual environment and with sound from old rerecorded phonographs in the Ruben Collection made available by the Royal Library in Aarhus.

virtual-gallery-of-denmark

I did not rely on the data provided at the event, but used data from the cultural institutions that were already uploaded to Wikimedia Commons and where the metadata was described on Wikidata. Both the images of the paintings (which was from Skagens Museum) and the sound were available at Wikimedia Commons and well-annotated on Wikidata.

The images was fetched with SPARQL queries to the Wikidata Query Service and API calls to the Wikimedia Commons API, and as such it is fairly easy to change the virtual environment to use other files which I did afterwards: The Giersing-Bach-Ishizaka-Nielsen virtual environment uses images on Wikimedia Commons where Wikidata records the artist as being Harald Giersing. Here the sound is from the Kimiko Ishizaka‘s Open Goldberg Variations project.

virtual-giersing.png

While a-frame models are suppose to run straight from the web browser on smartphones, my models seem to have hefty hardware requirements, – the images have quite high resolutions. It takes over 10 seconds on my computer to download all the image and sound files associated with the models. Nevertheless, with a strong computer, a big screen and good headphones, it is quite interesting to view and hear as the paintings and sound fly by.

Danish words for snow

Posted on Updated on

According to Laura Martin, Franz Boas may have been the first to point to the relative richness of Eskimo words for snow: “Eskimo Words for Snow”: A Case Study in the Genesis and Decay of an Anthropological Example. American Anthropologist, 88(2):418. Boas listed aput, qana, piqsirpoz, and qimuqsuq. English may have snow, hail, sleet, ice, icicle, slush, and snowflake as listed on the English Wikipedia on Eskimo words for snow. There seems to be more than that, e.g., firn. Danish is not (as) polysynthetic as Eskimo, but it has lots of compounds, which make it possible to create a good number of words for snow. Most of these words derive from sne and is.

Update 2017-09-13: Added skosse.

Word Translation Explanation
bræ large mass of ice
bundis ice at the bottom of the ocean/sea
drivis ice floating on the water, either “havis” og “søis”
firn firn snow older than a year
fastis
flodis ice from a river
fnug snowflake
frostsne snow below freezing, as oppose to tøsne
fygesne drifting snow
gletsjer/gletscher glacier
gletscheris ice in/from a glacier
grå is first stage of “ungis”, according to DMI
gråhvid is second state of “ungis”, according to DMI
hagl hail precipitate with small pellets of ice
haglkorn hailstone small pellet of ice
iglo/igloo  iglo
havis sea ice ice in the ocean/sea
indlandsis Indlandsisen is the big “iskappe” in Greenland
is ice frozen water that is (usually) transparent
isbarriere the edge of an “isshelf”, according to DMI
isbjerg
isblok block of ice
isfjeld
isflade sheet of ice
isflage floe
isfront the edge og a “isshelf”
isfod ice frozen to the coast or (second meaning) the ice below the water
iskalot ice-covered area near the poles
iskant the edge of a floe
iskappe ice cap very large connected mass of snow, e.g., the one in Greenland
isklump
isknold
iskorn see also “kornsne”
iskrystal
iskugle
isbræ large mass of ice, the same as “bræ”
islag layer of ice, not the same as “isslag”
isnåle
ispartikel
isrand the edge og a floe
isshelf floating gletcher
isskorpe layer of ice on top of water or snow
isskosse
isskruning ice pack
isslag glaze, black ice, freezing rain raindrops below freezing that becomes ice when hitting the ground or structure
isstykke a piece of ice
istap  icicle
isterning ice cube
isvand ice water water with ice in it, usually for drinking
isvæg
isø
julesne  Christmas snow snow falling or lying during Christmas
kalvis
kornsne
kunstsne  artificial snow snow artificially made
lavine avalanche
nysne snow recently falling, as opposed to firn
pakis “drivis” with a high concentration, according to DMI
polaris sea ice that have survived at least one summer meting
puddersne powder snow light snow
rim hard rime “white ice that forms when the water droplets in fog freeze to the outer surfaces of objects.” according to English Wikipedia
sjap slush
sjapis slush ice
skosse
slud sleet a mixture of rain and falling snow
sne snow used about falling snow and snow on the ground
snebold snowball snow formed as a ball, of used to through in a snowball fight
snebunke  pile of snow
snebyge snow shower
snedrive snowdrift
snedrys small amount of precipitation of snow
snedække layer/cover of snow
snefald
snefnug snowflake
snefog snowdrift
snefygning snow in strong wind
snehule snow formed as a cave for fun or survival, see also “igloo”
snehytte more or less the same as an “iglo”
snekorn snow grain
snekrystal
snelag layer of snow
snemand snowman snow formed as a sculpture of a human
snemark field of snow
snemasse mass of snow
snesjap slush
sneskred avalanche snow falling down a slope
snestjerne
snestorm snowstorm
snevejr snow weather with falling snow
tyndis thine ice
sjap sleet
søis “lake ice”
tøris (“tøris” is usually “dry ice”)
tøsne melting snow snow that is melting
ungis Sea ice between “tyndis” and “vinteris”, according to DMI
vinteris

Some information about Scholia

Posted on

Nielsen2017Scholia_coauthornormalizedcitations

Scholia is mostly a web service developed from GitHub at https://github.com/fnielsen/scholia in an open source fashion. It was inspired by discussions at the WikiCite 2016 meeting in Berlin. Anyone can contribute as long as their contribution is under GPL.

I started to write the Scholia code back in October 2016 according to the initial commit at https://github.com/fnielsen/scholia/commit/484104fdf60e4d8384b9816500f2826dbfe064ce Since then particularly Daniel Mietchen and Egon Willighagen have joined in and Egon has lately be quite active.

Users can download the code and run the web service from their own computer if they have a Python Flask development environment. Otherwise the canonical web site for Scholia is https://tools.wmflabs.org/scholia/ which anyone with an Internet connection should be able to view.

So what does Scholia do? The initial “application” was a “static” web page with a researcher profile/CV of myself based on data extracted from Wikidata. It is still available from: http://people.compute.dtu.dk/faan/fnielsenwikidata.html. I added a static page for my research section, DTU Cognitive Systems, showing scientific page production and a coauthor graph. This is available here: http://people.compute.dtu.dk/faan/cognitivesystemswikidata.html.

The Scholia web application was an extension of these initial static pages so a profile page for any researcher or any organization could be made on the fly. And now it is no longer just authors and organizations where there is a profile page, but also works, venues (journals or proceedings), series, publishers, sponsors (funders) and awards. We have also “topics” and individual pages showing specialized information about chemicals, proteins, diseases and biological pathways. A rudimentary search interface is implemented.

The content of the web pages of Scholia with plots and tables are made from queries to the Wikidata Query Service, – the extended SPARQL endpoint provided by the Wikimedia Foundation. We also pull in text from the introduction of the articles in the English Wikipedia. We modify the table output of the Wikidata Query Service so individual items displayed in table cells link back to other items in Scholia.

Egon Willighagen, Daniel Mietchen and I have described Scholia and Wikidata for scientometrics in the 16-pages workshop paper “Scholia and scientometrics with Wikidata” https://arxiv.org/pdf/1703.04222.pdf The screenshots shown in the paper has been uploaded to Wikimedia Commons. These and other Scholia media files are available in category page https://commons.wikimedia.org/wiki/Category:Scholia

Working with Scholia has been a great way to explore what is possible with SPARQL and Wikidata. One plot that I like is the “Co-author-normalized citations per year” plot on the organization pages. There is an example on this page: https://tools.wmflabs.org/scholia/organization/Q24283660. Here the citations to works authored by authors affiliated with the organization in question are counted and organized in a colored bar chart with respect to year of publication, – and normalized for the number of coauthors. The colored bar charts have been inspired by the “LEGOLAS” plots of Shubhanshu Mishra and Vetle Torvik.

Part of the Python Scholia code will also work as a command-line script for reference management in the LaTeX/BIBTeX environment using Wikidata as the backend. I have used this Scholia scheme for a couple of scientific papers I have written in 2017. The particular script is currently not well developed, so users would need to be indulgent.

Scholia relies on users adding bibliographic data to Wikidata. Tools from Magnus Manske are a great help as are Fatameh of “T Arrow” and “Tobias1984” and the WikidataIntegrator of the GeneWiki people. Daniel Mietchen, James Hare and a user called “GZWDer” have been very active adding much of the science bibligraphic information and we are now past 2.3 million scientific articles on Wikidata. You can count them with this link: https://tinyurl.com/yaux3uac

My h-index as of June 2017: Coverage of researcher profile sites

Posted on Updated on

The coverage of different researcher profile sites and their citation statistics varies. Google Scholar seems to be the site with the largest coverage, – it even crawls and indexes my slides. The open Wikidata is far from there, but may be the only one with machine-readable free access and advanced search.

Below is the citation statistics in the form of the h-index from five different services.

h Service
28 Google Scholar
27 ResearchGate
22 Scopus
22(?) Semantic Scholar
18 Web of Science
8 Wikidata

Semantic Scholar does not give an overview of the citation statistics, and the count is somewhat hidden on the individual article pages. I attempted as best as I could to determine the value, but it might be incorrect.

I made a similar statistics on 8 May 2017 and reported it on the slides Wikicite (page 42). During the one and a half month since that count, the statistics for Scopus has change from 20 to 22.

Semantic Scholar is run by the Allen Institute for Artificial Intelligence, a non-profit research institute, so they may be interested in opening up their data for search. An API does, to my knowledge, not (yet?) exist, but they have a gentle robots.txt. It is also possible to download the full Semantic Scholar corpus from http://labs.semanticscholar.org/corpus/. (Thanks to Vladimir Alexiev for bringing my attention to this corpus).