society

What does “toxicity” mean?

Posted on Updated on

There are now a range of people using “toxic” and “toxicity” in the context of messages on social media. I have had a problem with these words because I lacked a clear definition of the concept behind them. What is a toxic social media post? Negative sentiment, rudeness, harassment, cyberbullying, trolling, rumour spreading false news, heated arguments and possibly more may be mixed together.

The English Wiktionary currently lists “Severely negative or harmful” as a gloss for a figurative sense of the word “toxic”.

For social media, a 2015 post in relation to League of Legends, “Doing Something About the ‘Impossible Problem’ of Abuse in Online Games“,  mentions “toxicity” along with “online harassment”. They “classified online citizens from negative to positive”, apparently based on language from “trash talk” to “non-extreme but still generally offensive language”. What precisely “trash talk” is in the context of social media is not clear to me. The English Wikipedia describes “Trash-talk” in the context of sports. A related term, “Smack talk”, is defined for Internet behaviour.

There are now a few scholarly papers using the wording.

For instance, “Detecting Toxicity Triggers in Online Discussions” writes “Detecting these toxicity triggers is vital because online toxicity has a contagious nature” from September 2019 cites our paper “Good Friends, Bad News – Affect and Virality in Twitter“. I think that this citation has some issues. First of all, we do not use the word “toxicity” in the our paper. Previously in the paper the authors seems to equate toxicity with rudeness and harassment, but our paper did not specifically look at that. Our paper was particularly focus on “newsness” and sentiment score. A simplified conclusion would be that negative news are more viral. News articles are rarely rude or harassing.

Julian Risch and Ralf Krestel in “Toxic Comment Detection in Online Discussions” write: “A toxic comment is defined as a rude, disrespectful, or unreasonable comment that is likely to make other users leave a discussion”. This phrase seems to originate from the Kaggle competition “Toxic Comment Classification Challenge” from 2018: “…negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion)”. The aspects of the competition that would be classified would be “threats, obscenity, insults, and identity-based hate”.

Risch and Krestel have been the first I have run into with a good discussion of the aspects of what they call toxicity. They seems to be inspired by the work on Wikipedia citing Ellery Wulczyn et al.’s “Ex Machina: Personal Attacks Seen at Scale”. Wulczyn’s work goes back to 2016 with the Detox research project. This research project may have been spawned by an entry in the 2015 wishlist in the Wikimedia community. “Algorithms and insults: Scaling up our understanding of harassment on Wikipedia” is a blogpost on the research project.

The Wulczyn-paper describes the construction of a corpus of comments from the article and user talk pages of the English Wikipedia. The labelling described in the paper would focus on “personal attack or harassment”. The authors define a “toxicity level” quantity as the number of personal attacks by a user (in the particular year examined). Why “personal attack level” is not used instead of the word “toxicity” is not clear.

It is interesting that the Kaggle competition defines “toxicity” via the likelihood that would “make other users leave a discussion”. I would usually think that heated discussions would tend to attract people to the discussion, – at least in “discussion social media” such as Reddit, Twitter and Facebook, though I suppose this is an open question. I do not recall seeing any study modelling the relationship between user retention and personal attack and obscene language.

The paper “Convolutional Neural Networks for Toxic Comment Classification” from 2018 cites a Maeve Duggan PEW report, “Online Harassment“, in the context “Text arising from online interactive communication hides many hazards such as fake news, online harassment and toxicity”. If you lookup the PEW report the words “fake news” and “toxic” hardly appear (only quoting a user comment for “toxic masculinity”).

Google’s Perspective API can analyze a text and give back a “toxicity” score.

The current English Wikipedia article on “toxicity” only describe the chemical sense of the word. The “toxic” disambiguation page has 3 relevant links: “toxic leader”, “toxic masculinity” and “toxic workplace”.

It still seems to me that “toxicity” and “toxic” are a too fuzzy words to be used in serious contexts without proper definition. It is also not clear to me if, e.g., the expression of strong negative sentiment, which could potentially be classified as “toxic”, necessarily negatively affect productivity and the health of the community. The 2015 harassment survey from the Wikimedia Foundation examined “Effects of experiencing harassment on participation levels” (Figure 47) and at least here the effect seems to be seriously negative on Wikimedia projects participation level. The word toxic was apparently not used in the survey, though under the example ideas for improvements from the respondents are listed: “Scoring the toxicity of users and watching toxic users’ actions in a community tool like the anti-vandal software.”

Valg til Wikimedia Foundation-bestyrelsen af affiliates-valgte medlemmer

Posted on Updated on

De såkaldte affiliates, hvilket er Wikimedia chapters, User groups og Thematic groups, har mulighed for at vælge to pladser til Wikimedia Foundations (WMF) bestyrelse (Board of Trustees). Tidligere har det blot været Chapters der har haft mulighed for at vælge medlemmer, men fra januar 2019 er det nu også det betydelige antal af User groups der får indflydelse. Som jeg forstår er det for at få en bredere fundering, måske specielt af hvad der betegnes “emerging communities”.

De to nuværende affiliates-valgte er tidligere formand Christophe Henner fra Frankrig og ukrainske Nataliia Tymkiv. Communities vælger tre bestyrelsesmedlemmer. Disse medlemmer er James Heilman, Canada, Dariusz Jemielniak, Polen og spanske María Sefidari der i øjeblikket er formand. I forhold til affiliates-valgte synes der at være en fornemmelse for at community-valgte er fra store communities: Engelsk Wikipedia, Spansk Wikipedia. Det gælder så ikke helt for den polsk-valgte Jemielniak, der dog har gjort sig bemærket med en engelsk-sproget bog.

Affiliates-valget vil ske hurtigt i løbet af foråret 2019, hvor der først er en periode med nominereringer og derefter det egentlige valg. En håndfuld Wikimedianere fungerer som facilitatorer for valget. Disse facilitatorer kan ikke samtidig være nominerede, men hvis de fratræder facilitatorrollen kan de godt stille op. Jeg har indtryk af at de to nuværende medlemmer genopstiller.

Wikimedia Danmark skal deltage i afstemningen og spørgsmålet er så hvem vi skal stemme på og hvilke kriterier vi skal benytte. Henner og Tymkiv virker udmærkede og har jo erfaring. I hvilken grad de har evner til at banke i bordet og komme med originale levedygtige visioner står mindre klart for mig. Af andre der muligvis vil nomineres kan være Shani Evenstein. Hun virker også udmærket.

En person der stiller op bør ud over det formelle krav om bestyrelsesværdighed, have vægtig bestyrelseserfaring, forståelse for Wikimedia-bevægelsen og være et rimeligt tilgængeligt ansigt i det internationale Wikimediamiljø. Derudover være indstillet på at lægge en god portion ulønnet arbejdstimer på skæver timer af døgnet, og være opmærksom på at man arbejder for WMF, – ikke for affiliates, community eller Wikipedia. Hvis man kigger på sammensætningen i WMF er Europa & Nordamerika godt repræsenteret, dog ingen fra Nordeuropa. Der er en læge (James Heilman), akademikere, grundlæggeren Jimmy Wales, en med økonomierfaring (Tanya Capuano) og forskellige andre erfaringer. Henner synes at være den eneste med teknisk erfaring (et element jeg ville værdsætte) og derudover kan man sige at der mangler repræsentation fra Latinamerika (omend Seridari jo taler spansk), Afrika og Østasien (Esra’a Al Shafei har rod i Bahrain).

Afstemningen koordineres på Meta ved Affiliates-selected Board seats. Der findes vejledning til vælgere på Primer for user groups. Den hollandske formand Frans Grijzenhout har oploadet en handy scorematrix for kandidaterne. Nomineringen har også sin egen side. Nomineringerne er åbne indtil 30. April 2019. Efter at nomineringerne er indkommet er der kort tid i april og lidt af maj til at udfritte de nominerede.

 

 

 

 

Luftige spørgsmål til Wikimedia Strategi 2030

Posted on

Wikimedia forsøger at tænke langsigtet og lægge en strategi der sigter mod året 2030. Et udkast er tilgængelig fra https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017/Direction

Her er nogle luftige spørgsmål der måske ville kunne få folk til at tænke over tingene:

  1. Hvorfor skal vi ha’ en strategi? Bør Wikimedia ikke blot udvikling sig organisk? Kan vi overhovedet forsige meget til 2030? Hvis vi ikke allerede kender vores strategi sidder vi så ikke allerede fast?
  2. Sidder vi fast i wiki-interfacet?
  3. Skal vi fortsætte med PHP MediaWiki interfacet som det primære software?
  4. Hvorfor er Wikiversity ikke blevet større, og slet ikke eksisterende på dansk? Er det fordi folk ikke gide lave Wikiversity? Er det fordi vi ikke ved hvad wikiversity er eller skal være? Er det fordi wiki-tekniske ikke fungerer i undervisningssammenhæng. Hvad skal vi ændre for at få det til at fungere?
  5. Hvorfor laver folk ikke flere video? Er det fordi at det er teknisk for besværlig? Er det for produktionsmæssigt for besværligt? Hvordan kunne Wikimedia hjælpe?
  6. Hvorfor er Stackoverflow det primære sted for faglige spørgsmål og svar? Burde det ikke have været Wikimedia der var det?
  7. Skal Wikimedia Foundation modtage penge fra firmaer så som Google? Vil det kunne skabe et afhængighedsforhold? Ifølge Peter Gøtzsches mening er patientforeninger påvirket i uheldig retning på grund af afhængighed til medicinalfirmaer. Kan Wikimedia-bevægelsen løbe ind i samme problem? Skaber det problemer med pengedonation, for eksempel i forbindelse med lobbyvirksomhede til EU’s ophavsretsdirektiv?
  8. Hvorfor kan OpenStreetMap kører med et mindre budget? Skyldes det langt mindre server load? Burde Wikimedia neddrosle og vælge en slags OpenStreetMap-model med hvor server værket bliver bedre distribueret til andre?
  9. “Knowledge equity” er et af to centrale begreber i Wikimedia Foundations strategi og noget svært at oversætte. Financial equity er hvad der på danske betegnes egenkapital. Et latinsk ord der nærmer sig findes i Den Store Danske, ellers er min nærmeste tanke det forældede udtryk “billighed”, – “ret og billighed” som det hedder i en dansk sang. Et sådant ord kan vi næppe bruge. Hvad kan vi på dansk forstå som “knowledge equity”?
  10. Kan Wikimedia komme i en situation som man har set Cochrane Collaboration hvor den professionaliserede del af organisationen kommer til at udmanøvrere græsrødderne? Hvad gør vi for at det ikke ske?
  11. Skal vi være stolt af at den danske Wikipedia stort set er opbygget gratis? Sidst jeg spurgte på den danske Wikipedias Landsbybrønd om Wikimedia Strategi blev det nævnt.
  12. Knowledge as a service følger en as-a-service-mønster man ser i datalogi. Her kan det hedder Platform-as-a-service e software-as-a-service. Hvad skal vi egentlig ligge i det? Jeg selv har skabt Scholia, et websted der viser videnskabelige data fra Wikidata via SPARQL-forespørgsler til Wikidata Query Service og Ordia, der gør det samme for leksikografiske data. Som sådan falder tanker om knowledge as a service fint i slag, – og jeg har da også forgæves forsøgt at erindre om det var mig der var med til at foreslå begrebet ved et internationalt Wikimedia-møde i 2017.
  13. Skal Wikimedia engagere sig i aktivisme, så som det sås til afstemningen om EU’s nye ophavsretsdirektiv? Har vi nogen succeshistorier på at det hjælper?
  14. Wikimedia Danmark har fået penge af Wikimedia Foundation til blandt andet et roll-up-banner. Det har været brugt i nogle få sammenhænge og vist været i tv. Er det sådan at Wikimedia Foundation skal bruge dets penge?
  15. Den visuelle editor synes at kunne hjælpe mange nye brugere, men er redigering af Wikipedia på en smartphone ikke meget besværlig? Kan man overhoved gøre noget ved det?
  16. Skal Wikimedia Foundation støtte forskere der bygger værktøjer eller undersøger fænomener på Wikimedia’s wikier?
  17. Normalt fungerer Wikipedia hurtigt, men hvis man kommer til et net der er langsomt oplever man at der kan være frustrerende at arbejde med, for eksempel Wikidata. Er det mon ikke frustrere at arbejde med wikier fra lande som ikke har hurtigt Internet? HVad kan der gøres ved det?
  18. Linux udvikles med en distribueret model, og sådan gør man med mange andre software systemer. Hvor er Wikipedia og andre Wikimedia wikier ikke distribuerede hvor fork og pull requests er nemt?
  19. Hvor mange af Wikimedia Foundations indsamlede midler skal anvendes på events, så som Wikimania?

Open questions for the EU copyright directive

Posted on Updated on

I am wondering if there are any good sources for the scope and effect. I was interviewed by a Danish radio channel and I must admit that it was difficult for me to say much in that respect.

The proposal for the directive says that “not-for-profit online encyclopedia” and makes an exception. To me it is clear that the lawmakers have had Wikipedia in mind, – and thanks for that. But there are several issue:

  1. Would Wikipedia be characterized as not-for-profit when the typical license is the Creative Commons with no clause for the non-commercial?
  2. Would Wikimedia Commons fall in under the “not-for-profit online encyclopedia”? Some of my photos are used in commercial online news sites which makes at least Wikimedia Commons commercial in some sense. I wouldn’t characterize Wikimedia Commons as an encyclopedia, but rather as a media archive.
  3. What is the scope with respect to other Wikimedia sites, Wikiquote, Wikibooks, Wikiversity, Wikisource, Wikivoyage and possibly others? It seems to me that yet again there is an issue, – as I would not characterize them as encyclopedias.
  4. What other site are likely to be hit by either Article 15 or Article 17? For instance, Wikia, Referata, Soundcloud, Reddit, Bandcamp, WordPress, 500px.com? Referata, I imagine, is under 10 mio Euro but over 3 years and hit? Reddit would be hit by both articles? Soundcloud Article 17? (Back in June 2018, WordPress noted their concern: https://transparency.automattic.com/2018/06/12/were-against-bots-filtering-and-the-eus-new-copyright-directive/ Reddit has this Wednesday published an article https://redditblog.com/2019/03/20/error-copyright-not-detected-what-eu-redditors-can-expect-to-see-today-and-why-it-matters/

Are we able to say something about the possible outcomes we would see if the directive proposal is approved, for instance:

  1. Large 10+ mio Euro companies, particularly Google by their ownership of YouTube, regularly paying rights organizations to address Article 17?
  2. Large parts of YouTube not being available to Europeans?
  3. Twitter and Facebook stop showing snippets from linked sites?
  4. European newspapers paying Twitter and Facebook to display snippets in Twitter and Facebook?
  5. Twitter and Facebook paying European newspapers to allow the display of snippets?
  6. Websites such as Soundcloud needing to implement advanced copyright detection systems for audio?
  7. Some American Web 2.0 companies blacklisting access from Europe?
  8. Widespread implementation of identity verifications in Web 2.0 systems?
  9. Widespread implementation of plagiarism-like detection on Web 2.0 platforms where users may not be able to upload content, even if it is legal?
  10. Google using Article 17 against Facebook wrt. freebooting? See https://www.youtube.com/watch?v=t7tA3NNKF0Q (via YourWeirdEx@reddit)
  11. Small Internet forum owners needing to subscribe to the services of upload filter service providers?
  12. Google News shutting down in Europe? See https://www.theguardian.com/technology/2018/nov/18/google-news-may-shut-over-eu-plans-to-charge-tax-for-links

Scholia is more than scholarly profiles

Posted on Updated on

Scholia, a website originally started as service to show scholarly profiles from data in Wikidata, is actually not just for scholarly data.

Scholia can also show bibliographic information for “literary” authors and journalists.

An example that I have begun on Wikidata is for the Danish writer Johannes V. Jensen whose works pose a very interesting test case for Wikidata, because the interrelation between the works and editions can be quite complicated, e.g., news paper articles being merged into a poem that is then published in an edition that are then expanded and re-printed… Also the scholarly and journalistic work about Johannes V. Jensen can be recorded in Wikidata. Scholia currently records 30 entries about Johannes V. Jensen, – and that does not necessarily includes works about works written by Johannes V. Jensen.

An example of a bibliography of a journalist is that of Kim Wall. Her works are almost always addressing very unique topics, – fairly relevant as sources in Wikipedia articles. Examples include an article on a special modern Chinese wedding tradition in Fairy Tale Romances, Real and Staged and an article on furries It’s not about sex, it’s about identity: why furries are unique among fan cultures.

An interesting feature about most of Wall’s articles, is that she let the interviewee have the final word by adding a quotation as the very final paragraph. That is also the case with the two examples linked above. I suppose that say something of Wall’s generous journalistic approach.

 

 

Code for love: algorithmic dating

Posted on

One of the innovative Danish TV channels, DR3, has a history of dating programs with Gift ved første blik as, I believe, the initial program. A program with – literally – an arranged marriage between to participants matched by what was supposed to be relationship experts. Exported internationally as Married at First Sight the stability of the marriages has been low as very few of the couples have stayed together, – if one is to trust the information on the English Wikipedia.

Now my colleagues at DTU Compute has been involved in a new program called Koden til kærlighed (the code for love). Contrary to Gift ved første blik the participants are not going to get married during the program, but will live together for a month, – and as the perhaps most interesting part – the matches are determined by a learning algorithm: If you view the streamed program of the first episode you will have the delight of seeing glimpses of data mining Python code with Numpy (note the intermixed camelcase and underscore :).

The program seems to have been filmed with smartphone cameras for the most part. The participants are four couples of white heterosexual millenials. So far we have seen their expectations and initial first encounters, – so we are far from knowing whether my colleagues have done a good job with the algorithmic matching.

According to the program, the producers and the Technical University of Denmark have collected information from 1’400 persons in “well-functioning” relationships. There must have been pairs among the 1’400 so the data scientist can train the algorithm using pairs as the positive examples and persons that are not pairs as negative examples. The 350 new singles signed up for the program can then be matched together with the trained algorithm. And four couples of – I suppose – the top ranking matches were selected for the program.

Our Professor Jan Larsen was involved in the program and explained a bit more about the setup in the radio. The collected input was based on responses to 104 questions for 667 couples (apparently not quite 1’400). Important questions may have been related to sleep and education.

It will be interesting to follow the development of the couples. There are 8 episodes in this season. It would have been nice with more technical background: What are the questions? How exactly is the match determined? How is the importance of the questions determined? Has the producers done any “editing” in the relationships? (For instance, why are all participants in the age range 20-25 years?). When people matches how is the answer to the question matching: Are the answers homophilic or heterophilic? During the program there are glimpses of questions, that might have been used. Some examples are “Do you have a tv-set?”, “Which supermarket do you use?”and “How many relationships have you ended?” It is a question whether a question such as “Do you have a tv-set?” is a any use. 667 couples compared to 104 questions are not that much to train a model and one should think that less relevant questions could confuse the algorithm more than it would help.

“Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”

Posted on Updated on

From Peter Brodersen I hear that the budget of the Danish government for next year allocates funds to Dansk Sprognævn for the release of the Retskrivningsordbogen – the Danish official dictionary for word spelling.

It is mentioned briefly in an announcement from the Ministry of Culture: “Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”: 500.000 DKK allocated for the release of the dataset.

It is not clear under which conditions it is released. An announcement from Dansk Sprognævn writes “til sprogteknologiske formål” (to natural language processing purposes). I trust it is not just for natural language processing purposes, – but for every purpose!?

If it is to be used in free software/databases then a CC0 or better license is a good idea. We are still waiting for Wikidata for Wiktionary, the yet waporware with a multilingual, collaborative and structured dictionary. This ressource is CC0-based. The “old” Wiktionary has surprisingly not been used that much by natural language processing researcher. Perhaps because of the anarchistic structure of Wiktionary. Wikidata for Wiktionary could hopefully help with us with structuring lexical data and improve the size and the utility of lexical information. With Retskrivningsordbogen as CC0 it could be imported into Wikidata for Wiktionary and extended with multilingual links and semantic markup.