Latest Event Updates

The problem with Andreas Krause

Posted on Updated on

I first seem to have run into the name “Andreas Krause” in connection with NIPS 2017. Statistics with the Wikidata Query Service shows “Andreas Krause” to be one of the most prolific authors for that particular conference.

But who is “Andreas Krause”?

Google Scholar lists five “Andreas Krause”. An ETH Zürich machine learning researcher, a pharmacometrics researcher, a wood researcher, a economish networkish researcher  working from Bath and a Dresden-based battery/nano researcher. All the NIPS Krause works should likely be attributed to the machine learning researcher, and a read of the works reveal the affiliation to be to ETH Zürich.

An ORCID search reveals six “Andrea Krause”. Three of the Andreas Krause have no or almost no further information about them beyond the name and the ORCID identifier.

There is a Immanuel Krankenhaus Berlin rheumatologist which does not seem to be in Google Scholar.

There may even be more than these six “Andreas Krause”. For instance, the article Emotional Exhaustion and Job Satisfaction in Airport Security Officers – Work–Family Conflict as Mediator in the Job Demands–Resources Model has affiliation with “School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland, Olten, Switzerland”, thus topic and affiliation do not quite fit in with any of the previously mentioned “Andreas Krause”.

One interesting ambiguity is for Classification of rheumatoid joint inflammation based on laser imaging – which obviously is a rheumatology work but also has some machine learning aspects. There is computer scientist/machine learner Volker Tresp as co-author and the work is published in an IEEE venue. There is no affiliation on the “Andreas Krause” on the paper. It is likely the work of the rheumatologist, but you could also guess on the machine learner.

Yet another ambiguity is Biomarker-guided clinical development of the first-in-class anti-inflammatory FPR2/ALX agonist ACT-389949. The topic somewhat overlap between the pharmacokinetics and the domain of the Berlin researcher. The affiliation is to “Clinical Pharmacology, Actelion”, but interestingly, Google Scholar does not associate this paper with the pharmacokinetics researcher.

In conclusion, author disambiguation may be very difficult.

Scholia will can show the six Andreas Krause. But I am not sure that helps us very much.

Advertisements

Text mining in SQL?

Posted on

Are you able to do text mining in SQL?

One problem with SQL is the incompatible functional interfaces. If you read Jonathan Gennick’s book SQL Pocket Guide you note that a lot of effort is going on to explain the differencies between SQL engines. This is particular the case for string functions. A length function may be called LEN og LENGTH, except when it is called LENGTHB, LENGTH2 og LENGTH4. String concatenation may be done by “||” or perhaps CONCAT or “+”.  Substring extraction is with SUBSTRING except when it is SUBSTR. And so on. In conclusion, writing general text mining SQL that fits across SQL engines seems difficult, unless you invent a meta-SQL language and an associated compiler.

But apart from the problem of SQL incompatibility what kind of text mining can be done in SQL? I haven’t run into text mining in SQL before, but I see that here and there you will find some attempts, e.g., Ralph Winters shows tfidf-scaling in his slides Practical Text Mining with SQL using Relational Databases.

Below I will try a very simple word list-based sentiment analysis in SQL. I will use two Danish datasets: “lcc-sentiment” that is my derivation of a Danish text in the Leipzig Corpora Collection and my sentiment word list “AFINN” available in the afinn Python package.

Lets first get some text data into a SQLite database. We use a comma-separated values file from my lcc-sentiment GitHub repository.

import pandas as pd
import sqlite3

url = "https://raw.githubusercontent.com/fnielsen/lcc-sentiment/master/dan_mixed_2014_10K-sentences.csv"

csv_data = pd.read_csv(url, encoding='utf-8')
# Fix columns name error
csv_data = csv_data.rename(columns={
    csv_data.columns[0]: csv_data.columns[0].strip()})

with sqlite3.connect('text.db') as connection:
    csv_data.to_sql('text', connection)

We will also put an AFINN word list into a table

url = "https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-da-32.txt"
afinn_data = pd.read_csv(url, encoding='utf-8', sep='\t',
                         names=['word', 'sentiment'])

with sqlite3.connect('text.db') as connection:
    afinn_data.to_sql('afinn', connection)

Here we have used the Pandas Python package which very neatly downloads and reads comma- or tab-separated values files and adds the data to a SQL table in very few lines of code.

With the sqlite3 program we can view the 3 first rows in the “text” table constructed from the lcc-sentiment data

sqlite> SELECT * FROM text LIMIT 3;
0|1|0.0|09:05 DR2 Morgen - med Camilla Thorning og Morten Schnell-Lauritzen Nyheder med overblik, baggrund og udsyn.
1|2|2.0|09-10 sæson Spa Francorchamps S2000 Vinter Cup 10. januar 2011, af Ian Andersen Kvalifikation: Det gik nogenlunde, men bilen føltes god.
2|3|0.0|½ time og pensl dem derefter med et sammenpisket æg eller kaffe.

For word tokenization we would like to avoid special characters around word tokens. We can clean the text in this way

SELECT REPLACE(REPLACE(REPLACE(LOWER(text), '.', ' '), ',', ' '), ':', ' ') FROM text;

Now the text is lowercased and mostly separated by spaces. It would have been nice with some form of regular expression substitution here, – we definitely lack a cleaning of some other special characters. As far as I understand regular expression for a replace funtion is not readily available in SQLite, but on Stackoverflow there is a pointer from Vishal Tyagi to an extension implementing the functionality.

Splitting the text is somewhat more complicated and beyond my SQL capabilities. Other developers have run into the problem and suggested solutions on Stackoverflow. They tend to use what Wikipedia calls recursive common table expressions. There is a further complication because we do not only need to split a single string, but “iterate” over all rows with texts.

One Stackoverflow user that had the problem was sigalor and s/he managed to construct a solution. The code below copies and edits sigalor’s code: changes column and table names, make splits on spaces rather than on commas and include the text cleaning with REPLACE (remember Stackoverflow code is CC BY-SA 3.0). sigalor initially constructs another table with the maximum number of texts.

From the returned data I construct a “docterm” table.

-- create temporary table which buffers the maximum article ID, because SELECT MAX can take a very long time on huge databases
DROP TABLE IF EXISTS max_text_id;
CREATE TEMP TABLE max_text_id(num INTEGER);
INSERT INTO max_text_id VALUES((SELECT MAX(number) FROM text));

DROP TABLE IF EXISTS docterm;
CREATE TABLE docterm AS 
WITH RECURSIVE split(text_id, word, str, offsep) AS
(
    VALUES ( 0, '', '', 0 )
    UNION ALL
    SELECT
        CASE WHEN offsep==0 OR str IS NULL
            THEN text_id+1 
            ELSE text_id
        END,
        CASE WHEN offsep==0 OR str IS NULL
            THEN ''
            ELSE substr(str, 0,
              CASE WHEN instr(str, ' ')
                  THEN instr(str, ' ')
                  ELSE length(str)+1
              END)
        END,
        CASE WHEN offsep==0 OR str IS NULL
            THEN (SELECT
                    REPLACE(
                    REPLACE(
                    REPLACE(
                    REPLACE(lower(text), '.', ' '), ',', ' '), ':', ' '), '!', ' ')
                  FROM text WHERE number=text_id+1)
            ELSE ltrim(substr(str, instr(str, ' ')), ' ')
        END,
        CASE WHEN offsep==0 OR str IS NULL
            THEN 1                    
            ELSE instr(str, ' ')      
        END
        FROM split
        WHERE text_id<=(SELECT * FROM max_text_id)
) SELECT text_id, word FROM split WHERE word != ''; 

I can’t say it ain’t pretty, but we now got a docterm table.

I haven’t checked but I seriously suspect that there are compatibility issues, e.g., with the INSTR function. In Microsoft SQL Server I suppose you would need CHARINDEX instead and a reordering of the input argument to that function. SUBSTR/SUBSTRING is another problem.

We can take a look of the generated docterm matrix table with

sqlite> SELECT * FROM docterm LIMIT 10 OFFSET 10; 

yielding

1|schnell-lauritzen
1|nyheder
1|med
1|overblik
1|baggrund
1|og
1|udsyn
2|09-10
2|sæson
2|spa

We can now make a sentiment computation

SELECT
  text_id, SUM(afinn.sentiment) AS sentiment, text.text
FROM docterm, afinn, text
WHERE docterm.word = afinn.word AND docterm.text_id = text.number
GROUP BY text_id
ORDER BY sentiment
LIMIT 10;

Note that this is vanilla SQL.

I suspect there could be performance issues as there is no index on the word columns.

The SQL results in

2892|-11|Du maa have lært, at du ikke formaar at bære livets trængsler og sorger, men at uden Kristus kan du kun synke sammen under byrden i bekymringer og klager og mismod.
2665|-10|Det vil sige med familier, hvor der fx var vold i familien, misbrug, seksuelle overgreb, arbejdsløshed eller mangel på en ordentlig bolig.
360|-9|Angst: To tredjedele helbredes for panikangst inden for otte år, en tredjedel helbredes for social angst og kun en lav andel for generaliseret angst.
5309|-9|I sværere tilfælde mærkes pludselig jagende smerter i musklen (”delvis muskelbristning”, ”fibersprængning”) og i værste fald mærkes et voldsom smæld, hvorefter det er umuligt at bruge musklen (”total muskelbristning”).
7760|-9|Organisationen er kommet i besiddelse af tre videoer, der viser egyptiske fanger, som tilsyneladende er blevet tortureret og derefter dræbt.
8031|-9|Problemer med stress og dårligt psykisk arbejdsmiljø er som oftest et problem for flere faggrupper.
8987|-9|Tabuer blev brudt med opfordringen til afslutningen af Mubaraks styre, og med det eksplicitte krav om at sætte politiets generaler på anklagebænken for tortur og ulovlige arrestationer.
9299|-9|Udbedring af skader som følge af forkert rygning er dyr og besværlig, da rygningen er svært tilgængelig – og ofte kræver stillads før reparationer kan sættes i gang.
9477|-9|Ved at udføre deres angreb på israelske jøder som en del af det større mål at dræbe jøderne, som angivet i Hamas Pagten krænker mange af de palæstinensiske terrorister også Konventionen om at Undgå og Straffe Folkedrab.
1205|-8|Den 12. juli indledte den 21. panserdivision et modangreb mod Trig 33 og Point 24, som blev slået tilbage efter 2½ times kamp, med mere end 600 tyske døde og sårede efterladt strøet ud over området foran de australske stillinger.

This list is the texts that are scored to have the most negative sentiment. In the above ten examples, there seems to be no opinions and most are rather descriptive. They mostly describe various “bad” situations: diseases, families with violence, etc.

If we change to ORDER BY sentiment DESC then we get the text estimated to be positive.

3957|17|God webshop - gode priser og super hurtig forsendelse :) Super god kunde betjening, vi vil klart bestille her fra igen næste gang og anbefaler stedet til andre.
8200|14|ROSE RBS 19 glasses set - KAUFTIPP MountainBIKE 5/2014 ProCycling 04/2014: ProCycling har testet ROSE XEON X-LITE 7000 Konklusion: ROSE cyklen har det hele - den er stabil, super let og giver en hurtig styring.
8610|13|Silkeborg mad ud af huset til en god anledning Til enhver festlig komsammen i Silkeborg området hører festlige gæster, god stemning og god mad sig til.
5356|12|Ja, man ved aldrig hvordan de der glimmermist sprayer, men jeg kan også godt lide tilfældigheden i det. laver nærmest aldrig et LO uden glimmermist :) Og så ELSKER jeg bare Echo Park - det er det FEDESTE mærke Et super dejligt lo.
411|11|Århus-borgmester Nikolaj Vammen er begejstret for VIA University College, TEKOs indtog i Århus: ”Det er en fantastisk god nyhed, at TEKO nu lancerer sine kreative uddannelser inden for mode- og livsstilsbranchen i Århus.
3066|11|Elsker Zoo og er så heldig at jeg bor lige ved siden af, så ungerne nyder rigtig godt at især børneZoo :o) Jeg elsker alle jeres søde kommentarer, og forsøger så vidt muligt at svare på hver enkelt men det tager måske lidt tid.
5483|11|Jeg har bestilt ny cykel, som jeg først får i uge 19. Derfor Håber at I får en fantastisk, sjov og lærerig dag Regn er bare naturen, der sveder tilbage på os Administratoren har deaktiveret offentlig skrive adgang.
8180|11|Rigtig fint alternativ til en helt almindelig pailletjakke med de meget fine mønstre :-) Er sikker på, at du nok skal style den godt – og jeg ser frem til at få et kig med ;-) 21. august 2013 at 14:27 Tak, det er skønt at høre!
9918|11|Vi vil hermed sige tak for en rigtig god aften, med en god musik og god DJ.
84|10|34 kommentarer Everything Se indlæg isabellathordsen I sidste uge købte jeg disse fantastiske solbriller og denne smukke smukke russiske ring.

The first one is a very positive (apparent) review of a webshop (“Good web shop – good prices and super quick delivery …”), the second a praise of a bike. Most – if not all – of the 10 texts seems to be positive.

Text mining may not be particular convenient in SQL, but might be a “possible” option if there are problems with interfacing to languages that are more suitable to text mining.

 

Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table

Code

 

import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])

URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']
    except: 
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))

Results

The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan http://www.wikidata.org/entity/Q40579104
1299 jesstess Jessica McKellar http://www.wikidata.org/entity/Q19667922
475 triketora Tracy Chou http://www.wikidata.org/entity/Q24238925
347 olgabot Olga B. Botvinnik http://www.wikidata.org/entity/Q44163048
124 vsoch Vanessa V. Sochat http://www.wikidata.org/entity/Q30133235
84 brainwane Sumana Harihareswara http://www.wikidata.org/entity/Q18912181
75 lydiapintscher Lydia Pintscher http://www.wikidata.org/entity/Q18016466
56 agbeltran Alejandra González-Beltrán http://www.wikidata.org/entity/Q27824575
22 frimelle Lucie-Aimée Kaffee http://www.wikidata.org/entity/Q37860261
21 isabelleaugenstein Isabelle Augenstein http://www.wikidata.org/entity/Q30338957
20 cnap Courtney Napoles http://www.wikidata.org/entity/Q42797251
15 tudorache Tania Tudorache http://www.wikidata.org/entity/Q29053249
13 vedina Nina Jeliazkova http://www.wikidata.org/entity/Q27061849
11 mkutmon Martina Summer-Kutmon http://www.wikidata.org/entity/Q27987764
7 caoyler Catalina Wilmers http://www.wikidata.org/entity/Q38915853
7 esterpantaleo Ester Pantaleo http://www.wikidata.org/entity/Q28949490
6 NuriaQueralt Núria Queralt Rosinach http://www.wikidata.org/entity/Q29644228
2 rongwangnu Rong Wang http://www.wikidata.org/entity/Q35178434
2 lschiff Lisa Schiff http://www.wikidata.org/entity/Q38916007
1 SigridK Sigrid Klerke http://www.wikidata.org/entity/Q28152723
1 amrapalijz Amrapali Zaveri http://www.wikidata.org/entity/Q34315853
1 mesbahs Sepideh Mesbah http://www.wikidata.org/entity/Q30098458
1 ChristineChichester Christine Chichester http://www.wikidata.org/entity/Q19845665
1 BinaryStars Shima Dastgheib http://www.wikidata.org/entity/Q42091042
1 mollymking Molly M. King http://www.wikidata.org/entity/Q40705344
0 jannahastings Janna Hastings http://www.wikidata.org/entity/Q27902110
0 nmjakobsen Nina Munkholt Jakobsen http://www.wikidata.org/entity/Q38674430

Danish stopword lists

Posted on Updated on

Python’s NLTK package has some support for Danish and there is a small list of 94 stopwords. They are available with

>>> import nltk
>>> nltk.corpus.stopwords.words('danish')

MIT-licensed spaCy is another NLP Python package. The support for Danish is yet limited, but it has a stopword list. With version 2+ of spaCy, they are available from

>>> from spacy.lang.da.stop_words import STOP_WORDS

SpaCy 2.03 has 219 words in that list.

MIT-licensed “stopwords-iso” has a list of 170 words (October 2016 version). They are available from the GitHub repo at https://github.com/stopwords-iso.

The Snowball stemmer has 94 words at http://snowball.tartarus.org/algorithms/danish/stop.txt.

In R, the GPL-3-licensed tm package uses the Snowball stemmer stopword list. The 94 words are available with:

> install.packages("tm")
> library(tm)
> stopwords(kind="da")

The NLTK stopswords are also the same as the Snowball stopwords. It can be checked with:

import re
import nltk
import requests

url = "http://snowball.tartarus.org/algorithms/danish/stop.txt"
snowball_stopwords = re.findall('^(\w+)', requests.get(url).text,
                                flags=re.MULTILINE | re.UNICODE)
nltk_stopwords = nltk.corpus.stopwords.words('danish')
snowball_stopwords == nltk_stopwords

A search with an Internet search engine on “Danish stopwords” reveals several other pointers to lists.

 

 

Find titles of all works published by DTU Cognitive Systems in 2017

Posted on Updated on

Find titles of all works published by DTU Cognitive Systems in 2017! How difficult can that be? To identify all titles of works from a research organization? With Wikidata and the Wikidata Query Service (WDQS) at hand it shouldn’t be that difficult to do? Nevertheless, I ran into a few hatches:

  1. There is what we can call the “Nathan Churchill Problem”: Nathan Churchill was at one point affiliated with our research section Cognitive Systems and wrote papers, e.g., together with our Morten Mørup. One paper clearly identifies him as affiliated with our section. Searching the DTU website yields no homepage for him though. He is now at St. Michael’s Hospital, Toronto according to a newer paper. So is he no longer affiliated with the Cognitive Systems section? That’s somewhat difficult to establish with credible and citable sources. If he is not, then any simple SPARQL query on the WDQS for Cognitive Systems papers will yield his new papers which shouldn’t be counted as Cognitive Systems section papers. If we could point to a source that indicates whether his affiliation at our section is stopped we could add a qualifier to the P1416 property in his Wikidata entry and extend the SPARQL query. What I ended up doing, was to explicitly filter out two of Churchill’s publications with the ugly line “FILTER(?work != wd:Q42595201 && ?work != wd:Q36384548)“. The problem is of course not just confined to Churchill. For instance, Scholia currently lists new publications by our Søren Hauberg at the Scholia page for DIKU, – a department where he has previously been affiliated. We discussed the affiliation problem a bit in the Scholia paper, see page 253 (page 17).
  2. Datetime datatype conversion with xsd:dateTime. The filter on date is with this line: “FILTER(?publication_datetime >= "2017-01-01"^^xsd:dateTime)“. Something like “FILTER(?publication_datetime >= xsd:dateTime(2017))” does not work.
  3. Missing data. It is difficult to establish how complete the Wikidata listing is for our section with respect to publications. Scraping Google Scholar, PubMed and our local university database of publications could be a possibility, but this is far from streamlined with the tools I have developed.

The full query is listed below and the result is available from this link. Currently, 48 results are returned.

#defaultView:Table
SELECT ?workLabel 
WITH {
  SELECT 
    ?work (MIN(?publication_datetime) AS ?datetime)
  WHERE {
    # Find CogSys work
    ?researcher wdt:P108 | wdt:P463 | wdt:P1416/wdt:P361* wd:Q24283660 .
    ?work wdt:P50 ?researcher .
    ?work wdt:P31 wd:Q13442814 .
    
    # Nathan Churchill seems not longer to be affiliated!?
    FILTER(?work != wd:Q42595201 && ?work != wd:Q36384548)
    
    # Filter to year 2017
    ?work wdt:P577 ?publication_datetime .
    FILTER(?publication_datetime >= "2017-01-01"^^xsd:dateTime)
  }
  GROUP BY ?work 
} AS %results
WHERE {
  INCLUDE %results
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,es,fr,jp,nl,nl,ru,zh". }
}

 

Can you scrape Google Scholar?

Posted on

With the WikiCite project, the bibliographic information on Wikidata is increasing rapidly with Wikidata describing 9.3 million scientific articles and 36.6 million citations. As far as I can determine most of the work is currently done by James Hare and Daniel Mietchen. Mietchen’s Research Bot is over 11 million edits on Wikidata while Hare has 15 million edits. For entering data into Wikidata from PubMed you can basically walk your way through PMID starting with “1” with the Fatameh tool. Hare’s reference work can take advantage of a webservice provided by National Institute of Health. For instance, a URL such https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=5585223 will return a JSON formatted result with citation information. This specific URL is apparently what Hare used to setup P2860 citation information in Wikidata, see, e.g.,  https://www.wikidata.org/wiki/Q41620192#P2860. CrossRef may be another resource.

Beyond these resources, we could potentially use Google Scholar. A former terms of service/EULA of Google Scholar stated that: “You shall not, and shall not allow any third party to: […] (j) modify, adapt, translate, prepare derivative works from, decompile, reverse engineer, disassemble or otherwise attempt to derive source code from any Service or any other Google technology, content, data, routines, algorithms, methods, ideas design, user interface techniques, software, materials, and documentation; […] “crawl”, “spider”, index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof); (m) create or attempt to create a substitute or similar service or product through use of or access to any of the Service or proprietary information related thereto“. Here is “create or attempt to create a substitute or similar service” a stopping point.

The Google Scholar terms document seems now to have been superseded by the all embracing Google Terms of Service. This document seems less restrictive: “Don’t misuse our Services” and “You may not use content from our Services unless you obtain permission from its owner or are otherwise permitted by law.” So it may be or may not be ok to crawl and/or use/republish the data from Google Scholar. See also a StackExchange question. and another StackExchange question.

The Google robots.txt limits automated access with the following relevant lines:

Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?*cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholar_share

“/citations?user=” means that you are allowed to bot access the user profiles. Google Scholar user identifiers may be recorded in Wikidata by a dedicated property, so you could automatically access Google Scholar user profiles from the information in Wikidata.

So if there is some information you can get from Google Scholar is it worth it?

The Scholia code now adds a googlescholar.py module with some preliminary Google Scholar processing attempts. There is command-line based scraping of a researcher profile. For instance,

python -m scholia.googlescholar get-user-data gQVuJh8AAAAJ

It ain’t not working too well. As far as I can determine you need to page with JavaScript to get more than the initial 20 results (it would be interesting to examine the Publish or Perish software to see how a larger set of results is obtained). Not all bibliographic metadata is available for each item on the Google Scholar page – as far as I see: No DOI. No PubMed identifier. The author list may be abbreviated with an ellipsis (‘…’). Matching of the Google Scholar item with data already present in Wikidata seems not that straightforward.

It is worth remembering that Wikidata has the P4028 property to link to Google Scholar articles. There ain’t no many items using it yet though: 31. It was suggested by Vladimir Alexiev back in May 2017, but it seems that I am the only one using the property. Bot access to the link target provided by P4028 is – as far as I can see from the robots.txt – not allowed.

Do we have a final schema for Wikicite?

Posted on Updated on

No, Virginia, we do not have a final schema for Wikicite IMHO.

Wikicite is a project that focuses on sources in the Wikimedia universe. Currently, the most active part of Wikicite is the setup of bibliographic data from scientific articles in Wikidata with the tools of Magnus Manske, the Fatameh-duo and the GeneWiki people, and particular James Hare, Daniel Mietchen and Magnus Manske have been active in automatic and semi-automatic setup of data. Jakob Voß’ statistics says we have – as of medium October 2017 – metadata from almost 10 million publications in Wikidata and recorded over 36 million citation between the described works.

Given that so many bibliographic items have been setup in Wikidata it may be worth to ask whether we actually have a schema for the setup of this data. While we surely have sort-of a convention that tools and editors follow it is not complete and probably up for change.

Here are some Wikicite-related schema issues:

  1. What instance is a scientific article? Most tools use instance of Q13442814, currently “scientific article” in English. But what is this? In English “scientific” means something different than the usual translation into Danish (“videnskabelig”) or German (“wissenschaftlicher“), – and these words are used in the labels of Q13442814. “Scientific” usually only entails natural science, leaving out social science and the humanities (while “videnskabelig”/”wissenschaftlicher” entails social science and humanities too). An attempt to fix this problem is to call these articles “scholarly articles”. It is interesting to think that what is one of the most used classes in Wikidata – if not the most used class – has an language ambiguity. I see no reason to restricted Q13442814 to only the English sense of science. It is all too difficult to distinguish between scientific disciplines: Think of computational humanities.
  2. What about the ontology of scientific work? Currently, Q13442814 is set as a subclass of academic journal articles, but this is not how we use it as conference articles in proceedings are set to Q13442814. Is a so-called abstract a “scientific article”? “Abstracts” appear, e.g., in neuroimaging conferences, where they are full referenceable items published in proceedings or supplementary journal issues.
  3. What is the instances of scientific article in Wikidata describing? A work or an edition? What happens if the article is reprinted (it happens to important work)? Should we then create a new item? Or amend the old item? If we create a new item then how should we link the two? Should we create a third item as a work item? Should items in preprint archives have their own item? Should that issue depend on whether the preprint version and the canonical version are more or less the same?
  4. How do we represent the language of an article? There are two generally used properties: original language of work and language of the work. There is a discussion about deleting one of them.
  5. How do we represent an author? Today an author can be linked to the article via the P50 property. However, the author label may be different than the name written in the article (we may refer to this issue as the “Natalie Portman Problem” as she published a scientific article as “Natalie Hershlag”). P1932 as a qualifier to P50 may be used to capture the way that the name is represented in the article, – a possible solution. Recently, Manske’s author name resolver has started to copy the short author name to the qualifier under P50. For referencing, there is still the problem that the referencing software would likely need to determine the surname, and this is not trival for authors with suffixes and Spanish authors with multiple surnames.
  6. How do we record the affiliation of a paper. Publicly funded universities and other research entities would like to make statistics on, for instance, the paper production, but this is not possible to do precisely with today’s Wikidata as papers are usually not affiliated with organizations, – only indirectly by the author affiliation. And the author affiliation might change as the author moves between different institutions. We already noted this problem in the first article we wrote about Scholia.
  7. How do record the type of scientific publication? There are various subtypes, e.g., systematic review, original article, erratum, “letter”, etc. Or the state of the article: submitted, under-review, peer-review, not peer-reviewed. The “genre” and the “instance of” properties can be used, but I have seen no ruling convention.
  8. How do we record what software and which datasets have been used in the article, e.g., for digital preservation. Currently, we are using “used” (P2283). But should we have dedicated properties, e.g., “uses software“? Do we have a schema for datasets and software?
  9. How do we record the formatting of the title, e.g., case? Bibliographic reference management software may choose to capitalize some words. In BibTeX you have the possibility to format the title using LaTeX commands. Detailed formatting of titles in Wikidata is currently not done, and I do not believe we have dedicated properties to handle such cases.
  10. How do we manage journals that change titles? For instance, for BMJ we have several items covering the name changes: Q546003, Q15746654, and Q28464921. Is this how we should do? There is the P156 property to connect subsequent versions.
  11. How should we handle series of conference proceedings? A particular article can  be “published in” a proceedings and such a proceedings may be part of a “series” that is a “conference proceedings series“. However, according to my recollection some/one(?) Wikidata bot may link articles directly as “published in” the conference proceedings series: they can have ISSNs and look like ordinary scientific journals.
  12. When is an article published? You have a number of publishers setting a formal publication date in the future for an article that is actually published prior to that formal date. In Wikidata there is to my knowledge only a single property for publication date. Preprints yield other publication dates.
  13. A minor issue is P820, arXiv classification. According to documentation it should be used as a qualifier to P818, the arXiv identifier property. Embarrassingly, I overlooked that and the Scholia arXiv extraction program and Quickstatement generator outputs it/them as a proper property.

Update:

Do we have a schema for datasets and software? Well, yes, Virginia. For software Katherine Thornton & Co. have produced Modeling the Domain of Digital Preservation in Wikidata.