“Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”
From Peter Brodersen I hear that the budget of the Danish government for next year allocates funds to Dansk Sprognævn for the release of the Retskrivningsordbogen – the Danish official dictionary for word spelling.
It is mentioned briefly in an announcement from the Ministry of Culture: “Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”: 500.000 DKK allocated for the release of the dataset.
It is not clear under which conditions it is released. An announcement from Dansk Sprognævn writes “til sprogteknologiske formål” (to natural language processing purposes). I trust it is not just for natural language processing purposes, – but for every purpose!?
If it is to be used in free software/databases then a CC0 or better license is a good idea. We are still waiting for Wikidata for Wiktionary, the yet waporware with a multilingual, collaborative and structured dictionary. This ressource is CC0-based. The “old” Wiktionary has surprisingly not been used that much by natural language processing researcher. Perhaps because of the anarchistic structure of Wiktionary. Wikidata for Wiktionary could hopefully help with us with structuring lexical data and improve the size and the utility of lexical information. With Retskrivningsordbogen as CC0 it could be imported into Wikidata for Wiktionary and extended with multilingual links and semantic markup.
With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=5585223 will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g., https://www.wikidata.org/wiki/Q41620192#P2860. CrossRef may be another resource.
Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.
The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.
The Google robots.txt limits automated access with the following relevant lines:
Disallow: /scholar Disallow: /citations? Allow: /citations?user= Disallow: /citations?*cstart= Allow: /citations?view_op=new_profile Allow: /citations?view_op=top_venues Allow: /scholar_share
“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.
So if there is some information you can get from Google Scholar is it worth it?
python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ
It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.
According to Laura Martin, Franz Boas may have been the first to point to the relative richness of Eskimo words for snow: “Eskimo Words for Snow”: A Case Study in the Genesis and Decay of an Anthropological Example. American Anthropologist, 88(2):418. Boas listed aput, qana, piqsirpoz, and qimuqsuq. English may have snow, hail, sleet, ice, icicle, slush, and snowflake as listed on the English Wikipedia on Eskimo words for snow. There seems to be more than that, e.g., firn. Danish is not (as) polysynthetic as Eskimo, but it has lots of compounds, which make it possible to create a good number of words for snow. Most of these words derive from sne and is.
Update 2017-09-13: Added skosse.
|bræ||large mass of ice|
|bundis||ice at the bottom of the ocean/sea|
|drivis||ice floating on the water, either “havis” og “søis”|
|firn||firn||snow older than a year|
|flodis||ice from a river|
|frostsne||snow below freezing, as oppose to tøsne|
|gletscheris||ice in/from a glacier|
|grå is||first stage of “ungis”, according to DMI|
|gråhvid is||second state of “ungis”, according to DMI|
|hagl||hail||precipitate with small pellets of ice|
|haglkorn||hailstone||small pellet of ice|
|havis||sea ice||ice in the ocean/sea|
|indlandsis||Indlandsisen is the big “iskappe” in Greenland|
|is||ice||frozen water that is (usually) transparent|
|isbarriere||the edge of an “isshelf”, according to DMI|
|isblok||block of ice|
|isflade||sheet of ice|
|isfront||the edge og a “isshelf”|
|isfod||ice frozen to the coast or (second meaning) the ice below the water|
|iskalot||ice-covered area near the poles|
|iskant||the edge of a floe|
|iskappe||ice cap||very large connected mass of snow, e.g., the one in Greenland|
|iskorn||see also “kornsne”|
|isbræ||large mass of ice, the same as “bræ”|
|islag||layer of ice, not the same as “isslag”|
|isrand||the edge og a floe|
|isskorpe||layer of ice on top of water or snow|
|isslag||glaze, black ice, freezing rain||raindrops below freezing that becomes ice when hitting the ground or structure|
|isstykke||a piece of ice|
|isvand||ice water||water with ice in it, usually for drinking|
|julesne||Christmas snow||snow falling or lying during Christmas|
|kunstsne||artificial snow||snow artificially made|
|nysne||snow recently falling, as opposed to firn|
|pakis||“drivis” with a high concentration, according to DMI|
|polaris||sea ice that have survived at least one summer meting|
|puddersne||powder snow||light snow|
|rim||hard rime||“white ice that forms when the water droplets in fog freeze to the outer surfaces of objects.” according to English Wikipedia|
|slud||sleet||a mixture of rain and falling snow|
|sne||snow||used about falling snow and snow on the ground|
|snebold||snowball||snow formed as a ball, of used to through in a snowball fight|
|snebunke||pile of snow|
|snedrys||small amount of precipitation of snow|
|snedække||layer/cover of snow|
|snefygning||snow in strong wind|
|snehule||snow formed as a cave for fun or survival, see also “igloo”|
|snehytte||more or less the same as an “iglo”|
|snelag||layer of snow|
|snemand||snowman||snow formed as a sculpture of a human|
|snemark||field of snow|
|snemasse||mass of snow|
|sneskred||avalanche||snow falling down a slope|
|snevejr||snow||weather with falling snow|
|tøris||(“tøris” is usually “dry ice”)|
|tøsne||melting snow||snow that is melting|
|ungis||Sea ice between “tyndis” and “vinteris”, according to DMI|
Scholia is mostly a web service developed from GitHub at https://github.com/fnielsen/scholia in an open source fashion. It was inspired by discussions at the WikiCite 2016 meeting in Berlin. Anyone can contribute as long as their contribution is under GPL.
I started to write the Scholia code back in October 2016 according to the initial commit at https://github.com/fnielsen/scholia/commit/484104fdf60e4d8384b9816500f2826dbfe064ce Since then particularly Daniel Mietchen and Egon Willighagen have joined in and Egon has lately be quite active.
Users can download the code and run the web service from their own computer if they have a Python Flask development environment. Otherwise the canonical web site for Scholia is https://tools.wmflabs.org/scholia/ which anyone with an Internet connection should be able to view.
So what does Scholia do? The initial “application” was a “static” web page with a researcher profile/CV of myself based on data extracted from Wikidata. It is still available from: http://people.compute.dtu.dk/faan/fnielsenwikidata.html. I added a static page for my research section, DTU Cognitive Systems, showing scientific page production and a coauthor graph. This is available here: http://people.compute.dtu.dk/faan/cognitivesystemswikidata.html.
The Scholia web application was an extension of these initial static pages so a profile page for any researcher or any organization could be made on the fly. And now it is no longer just authors and organizations where there is a profile page, but also works, venues (journals or proceedings), series, publishers, sponsors (funders) and awards. We have also “topics” and individual pages showing specialized information about chemicals, proteins, diseases and biological pathways. A rudimentary search interface is implemented.
The content of the web pages of Scholia with plots and tables are made from queries to the Wikidata Query Service, – the extended SPARQL endpoint provided by the Wikimedia Foundation. We also pull in text from the introduction of the articles in the English Wikipedia. We modify the table output of the Wikidata Query Service so individual items displayed in table cells link back to other items in Scholia.
Egon Willighagen, Daniel Mietchen and I have described Scholia and Wikidata for scientometrics in the 16-pages workshop paper “Scholia and scientometrics with Wikidata” https://arxiv.org/pdf/1703.04222.pdf The screenshots shown in the paper has been uploaded to Wikimedia Commons. These and other Scholia media files are available in category page https://commons.wikimedia.org/wiki/Category:Scholia
Working with Scholia has been a great way to explore what is possible with SPARQL and Wikidata. One plot that I like is the “Co-author-normalized citations per year” plot on the organization pages. There is an example on this page: https://tools.wmflabs.org/scholia/organization/Q24283660. Here the citations to works authored by authors affiliated with the organization in question are counted and organized in a colored bar chart with respect to year of publication, – and normalized for the number of coauthors. The colored bar charts have been inspired by the “LEGOLAS” plots of Shubhanshu Mishra and Vetle Torvik.
Part of the Python Scholia code will also work as a command-line script for reference management in the LaTeX/BIBTeX environment using Wikidata as the backend. I have used this Scholia scheme for a couple of scientific papers I have written in 2017. The particular script is currently not well developed, so users would need to be indulgent.
Scholia relies on users adding bibliographic data to Wikidata. Tools from Magnus Manske are a great help as are Fatameh of “T Arrow” and “Tobias1984” and the WikidataIntegrator of the GeneWiki people. Daniel Mietchen, James Hare and a user called “GZWDer” have been very active adding much of the science bibligraphic information and we are now past 2.3 million scientific articles on Wikidata. You can count them with this link: https://tinyurl.com/yaux3uac
The coverage of different researcher profile sites and their citation statistics varies. Google Scholar seems to be the site with the largest coverage, – it even crawls and indexes my slides. The open Wikidata is far from there, but may be the only one with machine-readable free access and advanced search.
Below is the citation statistics in the form of the h-index from five different services.
|18||Web of Science|
Semantic Scholar does not give an overview of the citation statistics, and the count is somewhat hidden on the individual article pages. I attempted as best as I could to determine the value, but it might be incorrect.
I made a similar statistics on 8 May 2017 and reported it on the slides Wikicite (page 42). During the one and a half month since that count, the statistics for Scopus has change from 20 to 22.
Semantic Scholar is run by the Allen Institute for Artificial Intelligence, a non-profit research institute, so they may be interested in opening up their data for search. An API does, to my knowledge, not (yet?) exist, but they have a gentle robots.txt. It is also possible to download the full Semantic Scholar corpus from http://labs.semanticscholar.org/corpus/. (Thanks to Vladimir Alexiev for bringing my attention to this corpus).