Coming Scholia, WikiCite, Wikidata and Wikipedia sessions

Posted on

In the coming months I will have three different talks on Scholia, WikiCite, Wikidata and Wikipedia at al.:

  • 3. October 2018 in DGI-byen, Copenhagen, Denmark as part of Visuals and Analytics that Matter conference, – the concluding conference for the DEFF-sponsored project Research Output & Impact Analyzed and Visualized (ROIAV).
  • 7. November 2018 in Mannheim as part of the Linked Open Citation Database (LOC-DB) 2018 workshop.
  • 13. december 2018 at the library of the Technical University of Denmark as part of Wikipedia – a media for sharing knowledge and research, an event for researchers and students (and still in the planning phase).

In september I presented Scholia as part of the Workshop on Open Citations. The slides with title Scholia as of September 2018 is available here.


A viewpoint on a viewpoint on Wikipedia’s neutral point of view

Posted on Updated on

I recently looked into what we have of Wikipedia research from Denmark and discovered several papers that I did not know about. I have now added some to Wikidata, so that Scholia can show a list of them.

Among the papers was one from Jens-Erik Mai titled Wikipedian’s knowledge and moral duties. Starting from the English Wikipedia’s Neutral Point of View (NPOV) policy, he stresses a dichotomy between the subjective and the object and argues for a rewrite of the policy. Mai claims the policy has an absolutistic center and a relativistic edge, corresponding to an absolutistic majority view and relativistic minority views.

As a long time Wikipedia editor, I find Mai’s exposition is too theoretical. I lack good exemplifications: cases where the NPOV fails, and I cannot see in what concrete way the NPOV policy should be changed to accommodate Mai’s critique. I am not sure that Wikipedians distinguish so much between the objective and the subjective; the key dichotomy is verifiability vs. not veriability, – that the statements in Wikipedia are supported by reliable sources. In terms of center-edge, I came to think of events associated with conspiracy theories. Here the “center” view could be the conventional view while the conspiracy views the edge. It is difficult for me to accommodate a standpoint that conspiracy theories should be accepted as equal as the conventional view. It is neither clear to me that the center is uncontested and uncontroversial. Wikipedia – like a newspaper – has the ability to represent opposing viewpoints. This is done by attributing the viewpoint to the reliable sources that express them. For instance, central in the description of evaluation of films are quotations from reviews of major newspapers and notable reviewers.

I don’t see the support for the claim that the NPOV policy assumes a “politically dangerous ethical position”. On the contrary, Wikipedia is now – after the increase of fake news – been called the “last bastion”. The example given in The Atlantic post is the recent social media fuzz with respect to Sarah Jeong where Wikipedians reach a work with “shared facts about reality.”

Addressing “addressing age-related bias in sentiment analysis”

Posted on Updated on

Algorithmic bias is one of the hot topics of research at the moment. There are observations of trained machine learning models that display sexism. For instance, the paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” (Scholia entry) neatly shows one example in its title with bias in word embeddings, –  shallow machine learning models trained on a large corpus of text.

A recent report investigated ageism bias in a range of sentiment analysis method, including my AFINN word list: “Addressing age-related bias in sentiment analysis” (Scholia entry). The researchers scraped sentences from blog posts and extracted those sentences with the word “old” and excluded the sentences where the word did not refer to the age of the person. They then replaced “old” with the word “young” (apparently also “older” and “oldest” was considered somehow). The example sentences they ended up with were, e.g., “It also upsets me when I realize that society expects this from old people” and “It also upsets me when I realize that society expects this from young people”. These sentences (242 in total) were submitted to 15 sentiment analysis tools and statistics was made “using multinomial log-linear regressions (via the R package nnet […])”.

I was happy to see that my AFINN was the only one in Table 4 surviving the test for all regression coefficients being non-significant. However, Table 5 with implicit age analysis showed some bias in my word list.

But after a bit of thought I wondered why there could be any kind of bias in my word list. The paper list an exponentiated intercept coefficient to be 0.733 with a 95%-confidence interval from 0.468 to 1.149 for AFINN. But if I examine what my afinn Python package reports about the words “old”, “older”, “oldest”, “young”, “younger” and “youngest”, I get all zeros, i.e., these words are not scored to be either positive or negative:


>>> from afinn import Afinn
>>> afinn = Afinn()
>>> afinn.score('old')
>>> afinn.score('older')
>>> afinn.score('oldest')
>>> afinn.score('young')
>>> afinn.score('younger')
>>> afinn.score('youngest')

It is thus strange why there can be any form a bias – even non-significant. For instance, for the two example sentences “It also upsets me when I realize that society expects this from old people” and “It also upsets me when I realize that society expects this from young people” my afinn Python package scores them both with the sentiment -2. This value comes solely from the word “upsets”. There can be no difference between any of the sentences when you exchange the word “old” with “young”.

In their implicit analysis of bias where they use a word embedding, there could possibly creep some bias in somewhere with my word list, although it is not clear for me how this happens.

The question is then what happens in the analysis. Does the multinomial log-linear regression give a questionable result? Could it be that I misunderstand a fundamental aspect of the paper? While som data seem to be available here, I cannot identify the specific sentences they used in the analysis.

“Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”

Posted on Updated on

From Peter Brodersen I hear that the budget of the Danish government for next year allocates funds to Dansk Sprognævn for the release of the Retskrivningsordbogen – the Danish official dictionary for word spelling.

It is mentioned briefly in an announcement from the Ministry of Culture: “Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”: 500.000 DKK allocated for the release of the dataset.

It is not clear under which conditions it is released. An announcement from Dansk Sprognævn writes “til sprogteknologiske formål” (to natural language processing purposes). I trust it is not just for natural language processing purposes, – but for every purpose!?

If it is to be used in free software/databases then a CC0 or better license is a good idea. We are still waiting for Wikidata for Wiktionary, the yet waporware with a multilingual, collaborative and structured dictionary. This ressource is CC0-based. The “old” Wiktionary has surprisingly not been used that much by natural language processing researcher. Perhaps because of the anarchistic structure of Wiktionary. Wikidata for Wiktionary could hopefully help with us with structuring lexical data and improve the size and the utility of lexical information. With Retskrivningsordbogen as CC0 it could be imported into Wikidata for Wiktionary and extended with multilingual links and semantic markup.

The problem with Andreas Krause

Posted on Updated on

I first seem to have run into the name “Andreas Krause” in connection with NIPS 2017. Statistics with the Wikidata Query Service shows “Andreas Krause” to be one of the most prolific authors for that particular conference.

But who is “Andreas Krause”?

Google Scholar lists five “Andreas Krause”. An ETH Zürich machine learning researcher, a pharmacometrics researcher, a wood researcher, a economish networkish researcher  working from Bath and a Dresden-based battery/nano researcher. All the NIPS Krause works should likely be attributed to the machine learning researcher, and a read of the works reveal the affiliation to be to ETH Zürich.

An ORCID search reveals six “Andrea Krause”. Three of the Andreas Krause have no or almost no further information about them beyond the name and the ORCID identifier.

There is a Immanuel Krankenhaus Berlin rheumatologist which does not seem to be in Google Scholar.

There may even be more than these six “Andreas Krause”. For instance, the article Emotional Exhaustion and Job Satisfaction in Airport Security Officers – Work–Family Conflict as Mediator in the Job Demands–Resources Model has affiliation with “School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland, Olten, Switzerland”, thus topic and affiliation do not quite fit in with any of the previously mentioned “Andreas Krause”.

One interesting ambiguity is for Classification of rheumatoid joint inflammation based on laser imaging – which obviously is a rheumatology work but also has some machine learning aspects. There is computer scientist/machine learner Volker Tresp as co-author and the work is published in an IEEE venue. There is no affiliation on the “Andreas Krause” on the paper. It is likely the work of the rheumatologist, but you could also guess on the machine learner.

Yet another ambiguity is Biomarker-guided clinical development of the first-in-class anti-inflammatory FPR2/ALX agonist ACT-389949. The topic somewhat overlap between the pharmacokinetics and the domain of the Berlin researcher. The affiliation is to “Clinical Pharmacology, Actelion”, but interestingly, Google Scholar does not associate this paper with the pharmacokinetics researcher.

In conclusion, author disambiguation may be very difficult.

Scholia will can show the six Andreas Krause. But I am not sure that helps us very much.

Can you scrape Google Scholar?

Posted on

With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g., CrossRef may be another resource.

Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.

The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.

The Google robots.txt limits automated access with the following relevant lines:

Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?*cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholar_share

“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.

So if there is some information you can get from Google Scholar is it worth it?

The Scholia code now adds a module with some preliminary Google Scholar processing attempts. There is command-line based scraping of a researcher profile. For instance,

python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ

It ain’t not working too well. As far as I can determine you need to page with JavaScript to get more than the initial 20 results (it would be interesting to examine the Publish or Perish software to see how a larger set of results is obtained). Not all bibliographic metadata is available for each item on the Google Scholar page – as far as I see: No DOI. No PubMed identifier. The author list may be abbreviated with an ellipsis (‘…’). Matching of the Google Scholar item with data already present in Wikidata seems not that straightforward.

It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.

Danish words for snow

Posted on Updated on

According to Laura Martin, Franz Boas may have been the first to point to the relative richness of Eskimo words for snow: “Eskimo Words for Snow”: A Case Study in the Genesis and Decay of an Anthropological Example. American Anthropologist, 88(2):418. Boas listed aput, qana, piqsirpoz, and qimuqsuq. English may have snow, hail, sleet, ice, icicle, slush, and snowflake as listed on the English Wikipedia on Eskimo words for snow. There seems to be more than that, e.g., firn. Danish is not (as) polysynthetic as Eskimo, but it has lots of compounds, which make it possible to create a good number of words for snow. Most of these words derive from sne and is.

Update 2017-09-13: Added skosse.

Word Translation Explanation
bræ large mass of ice
bundis ice at the bottom of the ocean/sea
drivis ice floating on the water, either “havis” og “søis”
firn firn snow older than a year
flodis ice from a river
fnug snowflake
frostsne snow below freezing, as oppose to tøsne
fygesne drifting snow
gletsjer/gletscher glacier
gletscheris ice in/from a glacier
grå is first stage of “ungis”, according to DMI
gråhvid is second state of “ungis”, according to DMI
hagl hail precipitate with small pellets of ice
haglkorn hailstone small pellet of ice
iglo/igloo  iglo
havis sea ice ice in the ocean/sea
indlandsis Indlandsisen is the big “iskappe” in Greenland
is ice frozen water that is (usually) transparent
isbarriere the edge of an “isshelf”, according to DMI
isblok block of ice
isflade sheet of ice
isflage floe
isfront the edge og a “isshelf”
isfod ice frozen to the coast or (second meaning) the ice below the water
iskalot ice-covered area near the poles
iskant the edge of a floe
iskappe ice cap very large connected mass of snow, e.g., the one in Greenland
iskorn see also “kornsne”
isbræ large mass of ice, the same as “bræ”
islag layer of ice, not the same as “isslag”
isrand the edge og a floe
isshelf floating gletcher
isskorpe layer of ice on top of water or snow
isskruning ice pack
isslag glaze, black ice, freezing rain raindrops below freezing that becomes ice when hitting the ground or structure
isstykke a piece of ice
istap  icicle
isterning ice cube
isvand ice water water with ice in it, usually for drinking
julesne  Christmas snow snow falling or lying during Christmas
kunstsne  artificial snow snow artificially made
lavine avalanche
nysne snow recently falling, as opposed to firn
pakis “drivis” with a high concentration, according to DMI
polaris sea ice that have survived at least one summer meting
puddersne powder snow light snow
rim hard rime “white ice that forms when the water droplets in fog freeze to the outer surfaces of objects.” according to English Wikipedia
sjap slush
sjapis slush ice
slud sleet a mixture of rain and falling snow
sne snow used about falling snow and snow on the ground
snebold snowball snow formed as a ball, of used to through in a snowball fight
snebunke  pile of snow
snebyge snow shower
snedrive snowdrift
snedrys small amount of precipitation of snow
snedække layer/cover of snow
snefnug snowflake
snefog snowdrift
snefygning snow in strong wind
snehule snow formed as a cave for fun or survival, see also “igloo”
snehytte more or less the same as an “iglo”
snekorn snow grain
snelag layer of snow
snemand snowman snow formed as a sculpture of a human
snemark field of snow
snemasse mass of snow
snesjap slush
sneskred avalanche snow falling down a slope
snestorm snowstorm
snevejr snow weather with falling snow
tyndis thine ice
sjap sleet
søis “lake ice”
tøris (“tøris” is usually “dry ice”)
tøsne melting snow snow that is melting
ungis Sea ice between “tyndis” and “vinteris”, according to DMI