society

What does “toxicity” mean?

Posted on Updated on

There are now a range of people using “toxic” and “toxicity” in the context of messages on social media. I have had a problem with these words because I lacked a clear definition of the concept behind them. What is a toxic social media post? Negative sentiment, rudeness, harassment, cyberbullying, trolling, rumour spreading false news, heated arguments and possibly more may be mixed together.

The English Wiktionary currently lists “Severely negative or harmful” as a gloss for a figurative sense of the word “toxic”.

For social media, a 2015 post in relation to League of Legends, “Doing Something About the ‘Impossible Problem’ of Abuse in Online Games“,  mentions “toxicity” along with “online harassment”. They “classified online citizens from negative to positive”, apparently based on language from “trash talk” to “non-extreme but still generally offensive language”. What precisely “trash talk” is in the context of social media is not clear to me. The English Wikipedia describes “Trash-talk” in the context of sports. A related term, “Smack talk”, is defined for Internet behaviour.

There are now a few scholarly papers using the wording.

For instance, “Detecting Toxicity Triggers in Online Discussions” writes “Detecting these toxicity triggers is vital because online toxicity has a contagious nature” from September 2019 cites our paper “Good Friends, Bad News – Affect and Virality in Twitter“. I think that this citation has some issues. First of all, we do not use the word “toxicity” in the our paper. Previously in the paper the authors seems to equate toxicity with rudeness and harassment, but our paper did not specifically look at that. Our paper was particularly focus on “newsness” and sentiment score. A simplified conclusion would be that negative news are more viral. News articles are rarely rude or harassing.

Julian Risch and Ralf Krestel in “Toxic Comment Detection in Online Discussions” write: “A toxic comment is defined as a rude, disrespectful, or unreasonable comment that is likely to make other users leave a discussion”. This phrase seems to originate from the Kaggle competition “Toxic Comment Classification Challenge” from 2018: “…negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion)”. The aspects of the competition that would be classified would be “threats, obscenity, insults, and identity-based hate”.

Risch and Krestel have been the first I have run into with a good discussion of the aspects of what they call toxicity. They seems to be inspired by the work on Wikipedia citing Ellery Wulczyn et al.’s “Ex Machina: Personal Attacks Seen at Scale”. Wulczyn’s work goes back to 2016 with the Detox research project. This research project may have been spawned by an entry in the 2015 wishlist in the Wikimedia community. “Algorithms and insults: Scaling up our understanding of harassment on Wikipedia” is a blogpost on the research project.

The Wulczyn-paper describes the construction of a corpus of comments from the article and user talk pages of the English Wikipedia. The labelling described in the paper would focus on “personal attack or harassment”. The authors define a “toxicity level” quantity as the number of personal attacks by a user (in the particular year examined). Why “personal attack level” is not used instead of the word “toxicity” is not clear.

It is interesting that the Kaggle competition defines “toxicity” via the likelihood that would “make other users leave a discussion”. I would usually think that heated discussions would tend to attract people to the discussion, – at least in “discussion social media” such as Reddit, Twitter and Facebook, though I suppose this is an open question. I do not recall seeing any study modelling the relationship between user retention and personal attack and obscene language.

The paper “Convolutional Neural Networks for Toxic Comment Classification” from 2018 cites a Maeve Duggan PEW report, “Online Harassment“, in the context “Text arising from online interactive communication hides many hazards such as fake news, online harassment and toxicity”. If you lookup the PEW report the words “fake news” and “toxic” hardly appear (only quoting a user comment for “toxic masculinity”).

Google’s Perspective API can analyze a text and give back a “toxicity” score.

The current English Wikipedia article on “toxicity” only describe the chemical sense of the word. The “toxic” disambiguation page has 3 relevant links: “toxic leader”, “toxic masculinity” and “toxic workplace”.

It still seems to me that “toxicity” and “toxic” are a too fuzzy words to be used in serious contexts without proper definition. It is also not clear to me if, e.g., the expression of strong negative sentiment, which could potentially be classified as “toxic”, necessarily negatively affect productivity and the health of the community. The 2015 harassment survey from the Wikimedia Foundation examined “Effects of experiencing harassment on participation levels” (Figure 47) and at least here the effect seems to be seriously negative on Wikimedia projects participation level. The word toxic was apparently not used in the survey, though under the example ideas for improvements from the respondents are listed: “Scoring the toxicity of users and watching toxic users’ actions in a community tool like the anti-vandal software.”

Valg til Wikimedia Foundation-bestyrelsen af affiliates-valgte medlemmer

Posted on Updated on

De såkaldte affiliates, hvilket er Wikimedia chapters, User groups og Thematic groups, har mulighed for at vælge to pladser til Wikimedia Foundations (WMF) bestyrelse (Board of Trustees). Tidligere har det blot været Chapters der har haft mulighed for at vælge medlemmer, men fra januar 2019 er det nu også det betydelige antal af User groups der får indflydelse. Som jeg forstår er det for at få en bredere fundering, måske specielt af hvad der betegnes “emerging communities”.

De to nuværende affiliates-valgte er tidligere formand Christophe Henner fra Frankrig og ukrainske Nataliia Tymkiv. Communities vælger tre bestyrelsesmedlemmer. Disse medlemmer er James Heilman, Canada, Dariusz Jemielniak, Polen og spanske María Sefidari der i øjeblikket er formand. I forhold til affiliates-valgte synes der at være en fornemmelse for at community-valgte er fra store communities: Engelsk Wikipedia, Spansk Wikipedia. Det gælder så ikke helt for den polsk-valgte Jemielniak, der dog har gjort sig bemærket med en engelsk-sproget bog.

Affiliates-valget vil ske hurtigt i løbet af foråret 2019, hvor der først er en periode med nominereringer og derefter det egentlige valg. En håndfuld Wikimedianere fungerer som facilitatorer for valget. Disse facilitatorer kan ikke samtidig være nominerede, men hvis de fratræder facilitatorrollen kan de godt stille op. Jeg har indtryk af at de to nuværende medlemmer genopstiller.

Wikimedia Danmark skal deltage i afstemningen og spørgsmålet er så hvem vi skal stemme på og hvilke kriterier vi skal benytte. Henner og Tymkiv virker udmærkede og har jo erfaring. I hvilken grad de har evner til at banke i bordet og komme med originale levedygtige visioner står mindre klart for mig. Af andre der muligvis vil nomineres kan være Shani Evenstein. Hun virker også udmærket.

En person der stiller op bør ud over det formelle krav om bestyrelsesværdighed, have vægtig bestyrelseserfaring, forståelse for Wikimedia-bevægelsen og være et rimeligt tilgængeligt ansigt i det internationale Wikimediamiljø. Derudover være indstillet på at lægge en god portion ulønnet arbejdstimer på skæver timer af døgnet, og være opmærksom på at man arbejder for WMF, – ikke for affiliates, community eller Wikipedia. Hvis man kigger på sammensætningen i WMF er Europa & Nordamerika godt repræsenteret, dog ingen fra Nordeuropa. Der er en læge (James Heilman), akademikere, grundlæggeren Jimmy Wales, en med økonomierfaring (Tanya Capuano) og forskellige andre erfaringer. Henner synes at være den eneste med teknisk erfaring (et element jeg ville værdsætte) og derudover kan man sige at der mangler repræsentation fra Latinamerika (omend Seridari jo taler spansk), Afrika og Østasien (Esra’a Al Shafei har rod i Bahrain).

Afstemningen koordineres på Meta ved Affiliates-selected Board seats. Der findes vejledning til vælgere på Primer for user groups. Den hollandske formand Frans Grijzenhout har oploadet en handy scorematrix for kandidaterne. Nomineringen har også sin egen side. Nomineringerne er åbne indtil 30. April 2019. Efter at nomineringerne er indkommet er der kort tid i april og lidt af maj til at udfritte de nominerede.

 

 

 

 

Luftige spørgsmål til Wikimedia Strategi 2030

Posted on

Wikimedia forsøger at tænke langsigtet og lægge en strategi der sigter mod året 2030. Et udkast er tilgængelig fra https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017/Direction

Her er nogle luftige spørgsmål der måske ville kunne få folk til at tænke over tingene:

  1. Hvorfor skal vi ha’ en strategi? Bør Wikimedia ikke blot udvikling sig organisk? Kan vi overhovedet forsige meget til 2030? Hvis vi ikke allerede kender vores strategi sidder vi så ikke allerede fast?
  2. Sidder vi fast i wiki-interfacet?
  3. Skal vi fortsætte med PHP MediaWiki interfacet som det primære software?
  4. Hvorfor er Wikiversity ikke blevet større, og slet ikke eksisterende på dansk? Er det fordi folk ikke gide lave Wikiversity? Er det fordi vi ikke ved hvad wikiversity er eller skal være? Er det fordi wiki-tekniske ikke fungerer i undervisningssammenhæng. Hvad skal vi ændre for at få det til at fungere?
  5. Hvorfor laver folk ikke flere video? Er det fordi at det er teknisk for besværlig? Er det for produktionsmæssigt for besværligt? Hvordan kunne Wikimedia hjælpe?
  6. Hvorfor er Stackoverflow det primære sted for faglige spørgsmål og svar? Burde det ikke have været Wikimedia der var det?
  7. Skal Wikimedia Foundation modtage penge fra firmaer så som Google? Vil det kunne skabe et afhængighedsforhold? Ifølge Peter Gøtzsches mening er patientforeninger påvirket i uheldig retning på grund af afhængighed til medicinalfirmaer. Kan Wikimedia-bevægelsen løbe ind i samme problem? Skaber det problemer med pengedonation, for eksempel i forbindelse med lobbyvirksomhede til EU’s ophavsretsdirektiv?
  8. Hvorfor kan OpenStreetMap kører med et mindre budget? Skyldes det langt mindre server load? Burde Wikimedia neddrosle og vælge en slags OpenStreetMap-model med hvor server værket bliver bedre distribueret til andre?
  9. “Knowledge equity” er et af to centrale begreber i Wikimedia Foundations strategi og noget svært at oversætte. Financial equity er hvad der på danske betegnes egenkapital. Et latinsk ord der nærmer sig findes i Den Store Danske, ellers er min nærmeste tanke det forældede udtryk “billighed”, – “ret og billighed” som det hedder i en dansk sang. Et sådant ord kan vi næppe bruge. Hvad kan vi på dansk forstå som “knowledge equity”?
  10. Kan Wikimedia komme i en situation som man har set Cochrane Collaboration hvor den professionaliserede del af organisationen kommer til at udmanøvrere græsrødderne? Hvad gør vi for at det ikke ske?
  11. Skal vi være stolt af at den danske Wikipedia stort set er opbygget gratis? Sidst jeg spurgte på den danske Wikipedias Landsbybrønd om Wikimedia Strategi blev det nævnt.
  12. Knowledge as a service følger en as-a-service-mønster man ser i datalogi. Her kan det hedder Platform-as-a-service e software-as-a-service. Hvad skal vi egentlig ligge i det? Jeg selv har skabt Scholia, et websted der viser videnskabelige data fra Wikidata via SPARQL-forespørgsler til Wikidata Query Service og Ordia, der gør det samme for leksikografiske data. Som sådan falder tanker om knowledge as a service fint i slag, – og jeg har da også forgæves forsøgt at erindre om det var mig der var med til at foreslå begrebet ved et internationalt Wikimedia-møde i 2017.
  13. Skal Wikimedia engagere sig i aktivisme, så som det sås til afstemningen om EU’s nye ophavsretsdirektiv? Har vi nogen succeshistorier på at det hjælper?
  14. Wikimedia Danmark har fået penge af Wikimedia Foundation til blandt andet et roll-up-banner. Det har været brugt i nogle få sammenhænge og vist været i tv. Er det sådan at Wikimedia Foundation skal bruge dets penge?
  15. Den visuelle editor synes at kunne hjælpe mange nye brugere, men er redigering af Wikipedia på en smartphone ikke meget besværlig? Kan man overhoved gøre noget ved det?
  16. Skal Wikimedia Foundation støtte forskere der bygger værktøjer eller undersøger fænomener på Wikimedia’s wikier?
  17. Normalt fungerer Wikipedia hurtigt, men hvis man kommer til et net der er langsomt oplever man at der kan være frustrerende at arbejde med, for eksempel Wikidata. Er det mon ikke frustrere at arbejde med wikier fra lande som ikke har hurtigt Internet? HVad kan der gøres ved det?
  18. Linux udvikles med en distribueret model, og sådan gør man med mange andre software systemer. Hvor er Wikipedia og andre Wikimedia wikier ikke distribuerede hvor fork og pull requests er nemt?
  19. Hvor mange af Wikimedia Foundations indsamlede midler skal anvendes på events, så som Wikimania?

Open questions for the EU copyright directive

Posted on Updated on

I am wondering if there are any good sources for the scope and effect. I was interviewed by a Danish radio channel and I must admit that it was difficult for me to say much in that respect.

The proposal for the directive says that “not-for-profit online encyclopedia” and makes an exception. To me it is clear that the lawmakers have had Wikipedia in mind, – and thanks for that. But there are several issue:

  1. Would Wikipedia be characterized as not-for-profit when the typical license is the Creative Commons with no clause for the non-commercial?
  2. Would Wikimedia Commons fall in under the “not-for-profit online encyclopedia”? Some of my photos are used in commercial online news sites which makes at least Wikimedia Commons commercial in some sense. I wouldn’t characterize Wikimedia Commons as an encyclopedia, but rather as a media archive.
  3. What is the scope with respect to other Wikimedia sites, Wikiquote, Wikibooks, Wikiversity, Wikisource, Wikivoyage and possibly others? It seems to me that yet again there is an issue, – as I would not characterize them as encyclopedias.
  4. What other site are likely to be hit by either Article 15 or Article 17? For instance, Wikia, Referata, Soundcloud, Reddit, Bandcamp, WordPress, 500px.com? Referata, I imagine, is under 10 mio Euro but over 3 years and hit? Reddit would be hit by both articles? Soundcloud Article 17? (Back in June 2018, WordPress noted their concern: https://transparency.automattic.com/2018/06/12/were-against-bots-filtering-and-the-eus-new-copyright-directive/ Reddit has this Wednesday published an article https://redditblog.com/2019/03/20/error-copyright-not-detected-what-eu-redditors-can-expect-to-see-today-and-why-it-matters/

Are we able to say something about the possible outcomes we would see if the directive proposal is approved, for instance:

  1. Large 10+ mio Euro companies, particularly Google by their ownership of YouTube, regularly paying rights organizations to address Article 17?
  2. Large parts of YouTube not being available to Europeans?
  3. Twitter and Facebook stop showing snippets from linked sites?
  4. European newspapers paying Twitter and Facebook to display snippets in Twitter and Facebook?
  5. Twitter and Facebook paying European newspapers to allow the display of snippets?
  6. Websites such as Soundcloud needing to implement advanced copyright detection systems for audio?
  7. Some American Web 2.0 companies blacklisting access from Europe?
  8. Widespread implementation of identity verifications in Web 2.0 systems?
  9. Widespread implementation of plagiarism-like detection on Web 2.0 platforms where users may not be able to upload content, even if it is legal?
  10. Google using Article 17 against Facebook wrt. freebooting? See https://www.youtube.com/watch?v=t7tA3NNKF0Q (via YourWeirdEx@reddit)
  11. Small Internet forum owners needing to subscribe to the services of upload filter service providers?
  12. Google News shutting down in Europe? See https://www.theguardian.com/technology/2018/nov/18/google-news-may-shut-over-eu-plans-to-charge-tax-for-links

Scholia is more than scholarly profiles

Posted on Updated on

Scholia, a website originally started as service to show scholarly profiles from data in Wikidata, is actually not just for scholarly data.

Scholia can also show bibliographic information for “literary” authors and journalists.

An example that I have begun on Wikidata is for the Danish writer Johannes V. Jensen whose works pose a very interesting test case for Wikidata, because the interrelation between the works and editions can be quite complicated, e.g., news paper articles being merged into a poem that is then published in an edition that are then expanded and re-printed… Also the scholarly and journalistic work about Johannes V. Jensen can be recorded in Wikidata. Scholia currently records 30 entries about Johannes V. Jensen, – and that does not necessarily includes works about works written by Johannes V. Jensen.

An example of a bibliography of a journalist is that of Kim Wall. Her works are almost always addressing very unique topics, – fairly relevant as sources in Wikipedia articles. Examples include an article on a special modern Chinese wedding tradition in Fairy Tale Romances, Real and Staged and an article on furries It’s not about sex, it’s about identity: why furries are unique among fan cultures.

An interesting feature about most of Wall’s articles, is that she let the interviewee have the final word by adding a quotation as the very final paragraph. That is also the case with the two examples linked above. I suppose that say something of Wall’s generous journalistic approach.

 

 

Code for love: algorithmic dating

Posted on

One of the innovative Danish TV channels, DR3, has a history of dating programs with Gift ved første blik as, I believe, the initial program. A program with – literally – an arranged marriage between to participants matched by what was supposed to be relationship experts. Exported internationally as Married at First Sight the stability of the marriages has been low as very few of the couples have stayed together, – if one is to trust the information on the English Wikipedia.

Now my colleagues at DTU Compute has been involved in a new program called Koden til kærlighed (the code for love). Contrary to Gift ved første blik the participants are not going to get married during the program, but will live together for a month, – and as the perhaps most interesting part – the matches are determined by a learning algorithm: If you view the streamed program of the first episode you will have the delight of seeing glimpses of data mining Python code with Numpy (note the intermixed camelcase and underscore :).

The program seems to have been filmed with smartphone cameras for the most part. The participants are four couples of white heterosexual millenials. So far we have seen their expectations and initial first encounters, – so we are far from knowing whether my colleagues have done a good job with the algorithmic matching.

According to the program, the producers and the Technical University of Denmark have collected information from 1’400 persons in “well-functioning” relationships. There must have been pairs among the 1’400 so the data scientist can train the algorithm using pairs as the positive examples and persons that are not pairs as negative examples. The 350 new singles signed up for the program can then be matched together with the trained algorithm. And four couples of – I suppose – the top ranking matches were selected for the program.

Our Professor Jan Larsen was involved in the program and explained a bit more about the setup in the radio. The collected input was based on responses to 104 questions for 667 couples (apparently not quite 1’400). Important questions may have been related to sleep and education.

It will be interesting to follow the development of the couples. There are 8 episodes in this season. It would have been nice with more technical background: What are the questions? How exactly is the match determined? How is the importance of the questions determined? Has the producers done any “editing” in the relationships? (For instance, why are all participants in the age range 20-25 years?). When people matches how is the answer to the question matching: Are the answers homophilic or heterophilic? During the program there are glimpses of questions, that might have been used. Some examples are “Do you have a tv-set?”, “Which supermarket do you use?”and “How many relationships have you ended?” It is a question whether a question such as “Do you have a tv-set?” is a any use. 667 couples compared to 104 questions are not that much to train a model and one should think that less relevant questions could confuse the algorithm more than it would help.

“Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”

Posted on Updated on

From Peter Brodersen I hear that the budget of the Danish government for next year allocates funds to Dansk Sprognævn for the release of the Retskrivningsordbogen – the Danish official dictionary for word spelling.

It is mentioned briefly in an announcement from the Ministry of Culture: “Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”: 500.000 DKK allocated for the release of the dataset.

It is not clear under which conditions it is released. An announcement from Dansk Sprognævn writes “til sprogteknologiske formål” (to natural language processing purposes). I trust it is not just for natural language processing purposes, – but for every purpose!?

If it is to be used in free software/databases then a CC0 or better license is a good idea. We are still waiting for Wikidata for Wiktionary, the yet waporware with a multilingual, collaborative and structured dictionary. This ressource is CC0-based. The “old” Wiktionary has surprisingly not been used that much by natural language processing researcher. Perhaps because of the anarchistic structure of Wiktionary. Wikidata for Wiktionary could hopefully help with us with structuring lexical data and improve the size and the utility of lexical information. With Retskrivningsordbogen as CC0 it could be imported into Wikidata for Wiktionary and extended with multilingual links and semantic markup.

Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table

Code

 

import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])

URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']
    except: 
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))

Results

The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan http://www.wikidata.org/entity/Q40579104
1299 jesstess Jessica McKellar http://www.wikidata.org/entity/Q19667922
475 triketora Tracy Chou http://www.wikidata.org/entity/Q24238925
347 olgabot Olga B. Botvinnik http://www.wikidata.org/entity/Q44163048
124 vsoch Vanessa V. Sochat http://www.wikidata.org/entity/Q30133235
84 brainwane Sumana Harihareswara http://www.wikidata.org/entity/Q18912181
75 lydiapintscher Lydia Pintscher http://www.wikidata.org/entity/Q18016466
56 agbeltran Alejandra González-Beltrán http://www.wikidata.org/entity/Q27824575
22 frimelle Lucie-Aimée Kaffee http://www.wikidata.org/entity/Q37860261
21 isabelleaugenstein Isabelle Augenstein http://www.wikidata.org/entity/Q30338957
20 cnap Courtney Napoles http://www.wikidata.org/entity/Q42797251
15 tudorache Tania Tudorache http://www.wikidata.org/entity/Q29053249
13 vedina Nina Jeliazkova http://www.wikidata.org/entity/Q27061849
11 mkutmon Martina Summer-Kutmon http://www.wikidata.org/entity/Q27987764
7 caoyler Catalina Wilmers http://www.wikidata.org/entity/Q38915853
7 esterpantaleo Ester Pantaleo http://www.wikidata.org/entity/Q28949490
6 NuriaQueralt Núria Queralt Rosinach http://www.wikidata.org/entity/Q29644228
2 rongwangnu Rong Wang http://www.wikidata.org/entity/Q35178434
2 lschiff Lisa Schiff http://www.wikidata.org/entity/Q38916007
1 SigridK Sigrid Klerke http://www.wikidata.org/entity/Q28152723
1 amrapalijz Amrapali Zaveri http://www.wikidata.org/entity/Q34315853
1 mesbahs Sepideh Mesbah http://www.wikidata.org/entity/Q30098458
1 ChristineChichester Christine Chichester http://www.wikidata.org/entity/Q19845665
1 BinaryStars Shima Dastgheib http://www.wikidata.org/entity/Q42091042
1 mollymking Molly M. King http://www.wikidata.org/entity/Q40705344
0 jannahastings Janna Hastings http://www.wikidata.org/entity/Q27902110
0 nmjakobsen Nina Munkholt Jakobsen http://www.wikidata.org/entity/Q38674430

“Overzealous business types”?

Posted on Updated on

The University of Copenhagen and its problematic dismissal of notable scientist Hans Thybo have now landed in an editorial of Nature: “Corporate culture spreads to Scandinavia“. Their concluding claim is that “the threat is the colonization of universities by overzealous business types” (against academic freedom).

Interestingly, though the majority of the university board members is required by law to be from outside the university (not necessarily business), the university management has usually an academic background. And this is also the case for the management around Hans Thybo:

  1. The head of department for Hans Thybo is Claus Beier, see “Hans Thybos institutleder om fyringssagen“. Beier is a PhD and a professor with a long series of publications in climate change as can be studied on Google Scholar.
  2. Dean is John Renner Hansen, see “KU spildte ½ million på konsulentundersøgelse af Thybo for misbrug af forskningsmidler“. He is also researcher and claims to have “Approximately 600 publications in international refereed journals”
  3. Head of the university is Ralf Hemmingsen that I know as a notable researcher in psychiatry.

I am not convinced by the arguments in the Nature editorial which sets up “business types” against academics. I think that the case should rather be seen against the background of the case with Milena Penkowa and another story around the possible abuse of research funds on the Copenhagen University Hospital, see “Ny sag om fusk med penge til forskning“.

Guess which occupation is NOT the most frequent among persons from the Panama Papers

Posted on Updated on

POLITICIAN! Occupation as politician is not very frequent among people in the Panama Papers. This may come as a surprise to those who had studied a bubble chart put in a post on my blog. A sizeable portion of blog readers, tweeters and probably also Facebook users seem to have seriously misunderstood it. The crucial problem with the chart is that it is made from data in Wikidata, which only contains a very limited selection of persons from the Panama Papers. Let me tell you some background and detail the problem:

  1. Open Knowledge Foundation Danmark hosted a 2-hours meetup in Cafe Nutid organized by Niels Erik Kaaber Rasmussen the day after the release of the Panama Papers. We were around 10 data nerds sitting with our laptops and with the provided links most if not all started downloading the Panama Papers data files with the names and company information. Some tried installing the Neo4J database which may help querying the data.
  2. I originally spend most of my time at the cafe looking through the data by simple means. I used something like “egrep -i denmark’ on the officers.csv file. This quick command will likely pull out most of the Danish people in the release Panama Papers. The result of the command is a small manageable list of not more than 70 listings. Among the names I recognized NO politician, neither Danish nor international.
  3. The Danish broadcasting company DR has had a priority access to the data. It is likely they have examined the more complete data in detail. It is also likely that if there had been a Danish politician in the Panama Papers DR would have focused on that, breaking the story. NO such story came.. Thus I think that it is unlikely that there is any Danish politicians in the more complete Panama Papers dataset.
  4. Among the Danish listings in the officers.csv file from the released Panama Papers we found a couple of recognizable names. Among them was the name Knud Foldschack. Already Monday, the day of the release, a Danish newssite had run a media story about that name. One Knud Foldschack is a lawyer who has involved himself in cases for leftwing causes. Having such a lawyer mentioned in the Panama Papers was a too-good-to-be-true media story, – and it was. It turned out that Knud Foldschack had no less than both a father and a brother with the same name, and the newssite now may look forward to meet one of the Foldschacks in court as he wants compensation for being wrongly smeared. His brother seems to be some sort of business man. René Bruun Lauritsen is another name within the Danish part of the Panama Papers. A person bearing that name has had unfavourable mentioning in Danish media. One of the stories was his scheme of selling semen to women in need of a pregnancy. His unauthorized handling of semen with hand delivery got him a bit of a sentence. Another scheme involved outrageous stock trading. Whether Panama-Lauritsen is the same as Semen-Lauritsen I do not know, but one would be disappointed if such an unethical businessman was not in the Panama Papers. A third name shares a fairly unique name with a Danish artist. To my knowledge Danish media had not run any story on that name. But the overall conclusion of the small sample investigated, is that politicians are not present, but names may be related to business persons and possibly an artist.
  5. Wikidata is a site in the Wikipedia family of sites. Though not well-known, the Wikidata site is one of the most interesting projects related to Wikipedia and in terms of main namespace pages far larger than the English Wikipedia. Wikidata may be characterized as the structured cousin of WIkipedia. Rather than edit in free-form natural language as you do in Wikipedia, in Wikidata you only edit in predefined fields. Several thousand types of fields exist. To describe a person you may use fields such as date of birth, occupation, authority identifiers, such as VIAF, homepage and sex/gender.
  6. So what is in Wikidata? Items corresponding to almost all Wikipedia articles appear in Wikidata – not just the articles in the English Wikipedia, but also for every language version of Wikipedia. Apart from these items which can be linked to WIkipedia articles, Wikidata also has a considerable number of other items. For instance, one Dutch user has created items for a great number of paintings for the National Gallery of Denmark, – painting which for the most part have no Wikipedia article in any language. Although Wikidata records an impressive number of items, it does not record everything. The number of persons in Wikidata is only 3276363 at the time of writing and rarely includes persons that hasn’t made his/her mark in media. The typical listing in the Panama Papers is a relative unknown man. He will unlikely appear in Wikidata. And no one adds such a person just because s/he is listed in the Panama Papers. Obviously Wikidata has an extraordinary bias against famous persons: politicians, nobility, sports people, artists, performers of any kind, etc.
  7. Items for persons in Wikidata who also appear in the Panama Papers can indicate a link to the Panama Papers. There is no dedicated way to do this but the  ‘key event’ property has been used for that. It is apparently noted Wikimedian Gerard Meijssen who has made most of these edits. How complete it is with respect to persons in Wikidata I do not know, but Meijssen also added two Danish football players who I believe where only mentioned in Danish media. He could have relied on the English Wikipedia which had a overview of Panama Paper-listed people.
  8. When we have data in Wikidata, there are various ways to query the data and present them. One way use wiki whizkid Magnus Manske’s Listeria service with a query on any Wikipedia. Manske’s tool automagically builds a table with information. Wikimedia Danmark chairman Ole Palnatoke Andersen apparently had discovered Meijssen’s work on Wikidata, and Palnatoke used Manske’s tool to make a table with all people in Wikidata marked with the ‘key event’ “Panama Papers”. It only generates a fairly small list as not that many people in Wikidata are actually linked to the Panama Papers. Palnatoke also let Manske’s tool show the occupation for each person.
  9. Back to the Open Knowledge Foundation meeting in Copenhagen Tuesday evening: I was a bit disappointed not being able to data mine any useful information from the Panama Papers dataset. So after becoming aware of Palnatoke’s table I grabbed (stole) his query statement and modified to count the number of occupations. Wikimedia Foundation – the organization that hosts Wikipedia and Wikidata – has setup a so-called SPARQL endpoint and associated graphical interface. It allows any Web user to make powerful queries across all of Wikidata’s many millions of statements, including the limited number of statements about Panama Papers. The service is under continuous development and has in the past been somewhat unstable, but nevertheless is a very interesting service. Frontend developer Jonas Kress has in 2016 implemented several ways to display the query result. Initially it was just a plain table view, but now features results on a map – if any geocoordinates are along in the query result – and a bubble chart if there is any numerical data in the query result. Other later implemented forms of output results are timelines, multiview and networks. Making a bubble chart with counts of occupations with the SPARQL service is nothing more than a couple of lines of commands in the SPARQL language, and a push on the “Run” button. So the Panama Papers occupation bubble chart should rather be seen as a demonstration of capabilities of Wikidata and its associated services for quick queries and visualizations rather than a faithful representation of occupation of people mentioned in the released Panama Papers.
  10. A sizeable portion of people misunderstood the plot and regarded it as evidence of the dark deeds of politicians. Rather than a good understanding of the technical details of Wikidata, people used their preconceived opinions about politicians to interpret the bubble chart. They were helped along the way by, in my opinion, misleading title (“Panama Papers bubble chart shows politicians are most mentioned in document leak database”) and incomplete explanation in an article of The Independent. On the other hand, Le Monde had a good critical article.
  11. I believe my own blog where I published the plot was not to blame. It does include a SPARQL command so any knowledgeable person can see and modify the results himself/herself. Perhaps some people were confused of my blog describing me as a researcher, – and thought that this was a research result on the Panama Papers.
  12. My blog has in its several years of existence had 20,000 views. The single post with the Panama Papers bubble chart yielded a 10 fold increase in the number of views over the course of a few days, – my first experience with a viral post. Most referrals were from Facebook. The referral does not indicate which page on Facebook it comes from, so it is impossible to join the discussion and clarify any misunderstanding. A portion of referrals also came from Twitter and Reddit where I joined the discussion. Also social media users using the WordPress comment feature on my blog I tried to engage. On Reddit I felt a good response while for Facebook I felt it was irresponsible. Facebook boosts misconceptions and does not let me join the discussion and engage to correct any misconceptions.

    panamabubble
    The plot of a viral post: Views on my blog around the time with the Panama Papers bubble chart publication.
  13. Is there anything I could have done? I could have erased my two tweets and modified my blog post introducing a warning with a stronger explanation.

Summing up my experience with the release of the Panama Papers and the subsequent viral post, I find that our politicians show not to be corrupt and do not deal with shady companies – except for a few cases. Rather it seems that loads of people had preconceived opinions about their politicians and they are willing to spread their ill-founded beliefs to the rest of the world. They have little technical understand and does not question data provenance. The problems may be augmented by Facebook.

And here is the now infamous plot:

PanamaPapersOccupations