science

What does “toxicity” mean?

Posted on Updated on

There are now a range of people using “toxic” and “toxicity” in the context of messages on social media. I have had a problem with these words because I lacked a clear definition of the concept behind them. What is a toxic social media post? Negative sentiment, rudeness, harassment, cyberbullying, trolling, rumour spreading false news, heated arguments and possibly more may be mixed together.

The English Wiktionary currently lists “Severely negative or harmful” as a gloss for a figurative sense of the word “toxic”.

For social media, a 2015 post in relation to League of Legends, “Doing Something About the ‘Impossible Problem’ of Abuse in Online Games“,  mentions “toxicity” along with “online harassment”. They “classified online citizens from negative to positive”, apparently based on language from “trash talk” to “non-extreme but still generally offensive language”. What precisely “trash talk” is in the context of social media is not clear to me. The English Wikipedia describes “Trash-talk” in the context of sports. A related term, “Smack talk”, is defined for Internet behaviour.

There are now a few scholarly papers using the wording.

For instance, “Detecting Toxicity Triggers in Online Discussions” writes “Detecting these toxicity triggers is vital because online toxicity has a contagious nature” from September 2019 cites our paper “Good Friends, Bad News – Affect and Virality in Twitter“. I think that this citation has some issues. First of all, we do not use the word “toxicity” in the our paper. Previously in the paper the authors seems to equate toxicity with rudeness and harassment, but our paper did not specifically look at that. Our paper was particularly focus on “newsness” and sentiment score. A simplified conclusion would be that negative news are more viral. News articles are rarely rude or harassing.

Julian Risch and Ralf Krestel in “Toxic Comment Detection in Online Discussions” write: “A toxic comment is defined as a rude, disrespectful, or unreasonable comment that is likely to make other users leave a discussion”. This phrase seems to originate from the Kaggle competition “Toxic Comment Classification Challenge” from 2018: “…negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion)”. The aspects of the competition that would be classified would be “threats, obscenity, insults, and identity-based hate”.

Risch and Krestel have been the first I have run into with a good discussion of the aspects of what they call toxicity. They seems to be inspired by the work on Wikipedia citing Ellery Wulczyn et al.’s “Ex Machina: Personal Attacks Seen at Scale”. Wulczyn’s work goes back to 2016 with the Detox research project. This research project may have been spawned by an entry in the 2015 wishlist in the Wikimedia community. “Algorithms and insults: Scaling up our understanding of harassment on Wikipedia” is a blogpost on the research project.

The Wulczyn-paper describes the construction of a corpus of comments from the article and user talk pages of the English Wikipedia. The labelling described in the paper would focus on “personal attack or harassment”. The authors define a “toxicity level” quantity as the number of personal attacks by a user (in the particular year examined). Why “personal attack level” is not used instead of the word “toxicity” is not clear.

It is interesting that the Kaggle competition defines “toxicity” via the likelihood that would “make other users leave a discussion”. I would usually think that heated discussions would tend to attract people to the discussion, – at least in “discussion social media” such as Reddit, Twitter and Facebook, though I suppose this is an open question. I do not recall seeing any study modelling the relationship between user retention and personal attack and obscene language.

The paper “Convolutional Neural Networks for Toxic Comment Classification” from 2018 cites a Maeve Duggan PEW report, “Online Harassment“, in the context “Text arising from online interactive communication hides many hazards such as fake news, online harassment and toxicity”. If you lookup the PEW report the words “fake news” and “toxic” hardly appear (only quoting a user comment for “toxic masculinity”).

Google’s Perspective API can analyze a text and give back a “toxicity” score.

The current English Wikipedia article on “toxicity” only describe the chemical sense of the word. The “toxic” disambiguation page has 3 relevant links: “toxic leader”, “toxic masculinity” and “toxic workplace”.

It still seems to me that “toxicity” and “toxic” are a too fuzzy words to be used in serious contexts without proper definition. It is also not clear to me if, e.g., the expression of strong negative sentiment, which could potentially be classified as “toxic”, necessarily negatively affect productivity and the health of the community. The 2015 harassment survey from the Wikimedia Foundation examined “Effects of experiencing harassment on participation levels” (Figure 47) and at least here the effect seems to be seriously negative on Wikimedia projects participation level. The word toxic was apparently not used in the survey, though under the example ideas for improvements from the respondents are listed: “Scoring the toxicity of users and watching toxic users’ actions in a community tool like the anti-vandal software.”

NeurIPS in Wikidata

Posted on Updated on

scholia-neurips-2019-co-authors.png
Co-authors in the NeurIPS 2019 conference based on data in Wikidata. Screenshot based on Scholia at https://tools.wmflabs.org/scholia/event/Q61582928.

The machine learning and neuroinformatics conference NeurIPS 2019 (NIPS 2019) takes place in the middle of December 2019. The conference series has always had a high standing and has grown considerably in reputation in recent years.

All papers from the conference are available online at papers.nips.cc. There is – to my knowledge – little structured metadata associated with the papers, though the website is consistently formatted in HTML and metadata can thus relatively easy be scraped. There are no consistent identifiers that I know of that identifies the papers on the site: No DOI, no ORCID iD or anything else. A few papers may be indexed here and there on third-party sites.

In the Scholia python package, I have made a module for scraping the papers.nips.cc website and convert the metadata to the Quickstatement format for Magnus Manske’s web application that submits the data to Wikidata. The entry of the basic metadata about the papers from NeurIPS is more or less complete. A check is needed to see if all is entered. One issue that the Python code attempts to counter is the cases where the scraped paper is already entered in Wikidata. Given that there are no identifiers the match attempt is somewhat heuristic.

Authors have separate webpages on papers.nips.cc with listing of papers published at the conference. This is quite well-curated, though I have discovered authors that have several webpages associated: The Bayesian Carl Edward Rasmussen is under http://papers.nips.cc/author/carl-edward-rasmussen-1254, https://papers.nips.cc/author/carl-e-rasmussen-2143 and http://papers.nips.cc/author/carl-edward-rasmussen-6617. Joshua B. Tenenbaum is also split.

Authors are not resolved with the code from Scholia. They are just represented as strings. The Author Disambiguator tool that Arthur P. Smith has built from a tool by Magnus Manske can semi-automatically resolve authors, i.e., associate the author of a paper with a specific Wikidata item representing a human. The Scholia web site has particular pages (“missing”) that make contextual links to the Author Disambiguator. For the NeurIPS 2019 proceedings the links can be seen at https://tools.wmflabs.org/scholia/venue/Q68600639/missing. There are currently over 1,400 authors that needs to be resolved. Some of these are not easy. Multiple authors may share the same name, e.g., some European names, e.g., Andreas Krause, and I have difficulty knowing how unique East Asian names are. So far only 50 authors from the NeurIPS conference have been resolved.

There is no citation information when the data is first entered with the Scholia and Quickstatement tools. There are currently no means to automatically enter that information. NeurIPS proceedings are – as far as I know – not available through CrossRef.

Since there is little editorial control of the format of the references the come in various shapes and may need “interpretation”. For instance, “Semi-Supervised Learning in Gigantic Image Collections” claims a citation to “[3] Y. Bengio, O. Delalleau, N. L. Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning
eigenfunctions links spectral embedding and kernel PCA. In NIPS, pages 2197–2219, 2004.” But that is unlikely a NIPS paper, and the reference should likely go to Neural Computation.

The ontology of Wikidata for annotating what papers are about is not necessarily good. Some concepts in cognitive sciences, including psychology, machine learning and neuroscience, become split or merged, e.g., Reinforcement learning is an example where the English Wikipedia article focus on the machine learning aspect, while Wikidata also tag neuroscience-oriented articles with the concept. For many papers I find it difficult to link to the ontology as the topic of the paper is so specialized that it is difficult to identify an appropriate Wikidata item.

With the data in Wikidata, it is possible to see many aspect of the data with the Wikidata Query Service and Scholia. For instance,

  1. Who has the most papers at NeurIPS 2019? A panel of a Scholia page readily shows this to be Sergey Levine, Francis Bach, Pieter Abbeel and Yoshua Bengio.
  2. The heuristically computed topic scores on the event page for NeurIPS 2019 show that reinforcement learning, generative adversarial network, deep learning, machine learning and meta-learning are central topics this year. (here one needs to keep in mind that the annotation in Wikidata is imcomplete)
  3. Which Danish researcher has been listed as an author on most NeurIPS papers through time? This is possible to ask with a query to the Wikidata Query Service: https://w.wiki/DTp. It depends upon what is meant by “Danish”. Here it is based on employment/affiliation and gives Carl Edward Rasmussen, Ole Winther, Ricardo Henao, Yevgeny Seldin, Lars Kai Hansen and Anders Krogh.

Grammatik over det Danske Sprog

Posted on Updated on

Grammatik over det Danske Sprog (GDS) is a three-volume book series attempting to describe the grammar of the Danish language. The first edition was published in 2011, while the second is from 2019. Here is a few notes about the work:

  1. It is unclear to me if there is any difference between the first and second editions. I found no changes in the pages. I suppose minor changes might have occurred here and there.
  2. If one had thought that it would be straightforward to develop a computer-based grammar checker from the work, then one is wrong. It seems that the book has not been written with any computational linguistics in mind. But I should think that the many examples in the book can be used to generate data for training and evaluation computer-based systems.
  3. Interestingly, certain location adverbs are regarded as adverbs with inflection (page 216). Words such as ned, nedad and nede, I would regard them as independent lexemes, while GDS regards them as inflected based on telicity and dynamics/statics. In Wikidata, I have created three lexemes, rather than one lexemes with three forms. To me nedad is a compound of the two word “ned” and “ad”.
  4. Word stems are regarded as a feature of the form rather than the lexeme (page 165), so that the stem of adjective smukke is not smuk, but smukk!
  5. A word such as begynde is regarded as a cranberry word with gynde as the cranberry morpheme (page 255). Den Store Danske points to the middle German beginnen instead. If we take GDS’s approach then begynde should be said to be composed of the be- prefix and the gynde cranberry morpheme.
  6. From GDS and other Danish grammar works, I have not yet come to a clear understanding why certain linguistic elements are regarded as prefixes and when they are regarded as words in compounding. For instance, an- in ankomme is regarded as a prefix in GDS (page 256), but an is also an independent word and can go together with komme (“komme an”).
  7. The concept of “nexual” and “innexual” nouns (Danish: “neksuale/inneksuale substantiver”) is described, but it is not clear to me how words for agents (painter, human, animal) or words such as home or landscape should be annotated with respect to the concept.
  8. Lexemes for cardinal and ordinal numbers are called “kardinaltal” and “ordinaltal”. In Wikidata, I invented the words kardinaltalord and ordinaltalord to distinguish between cardinal numbers and the words that represent cardinal numbers.
  9. There are hardly any inline references. In many cases I am uncertain on whether the claim they present are widely established knowledge among linguistist or whether it is the authors’ sole expert opinion, which may or may not be contested.

Roberta’s +5-fine workshop on natural language processing

Posted on

Interacting Minds Centre (Scholia) at Aarhus University (Scholia) held a finely organized workshop, NLP workshop @IMC, Fall 2019, in November 2019 where I gave a talk title Detecting the odd-one-out among Danish words and phrases with word embeddings.

Fritz Günther (Scholia) keynoted from his publication Vector-Space Models of Semantic Representation From a Cognitive Perspective: A Discussion of Common Misconceptions (Scholia). A question is whether the distributed semantic models/vector models we identify from computing on large corpora makes sense with respect to cognitive theories: “Although DSMs might be valuable for engineering word meanings, this does not automatically qualify them as plausible psychological models”.

Two Årups displayed their work on SENTIDA “a new tool for sentiment analysis in Danish”. In its current form it is an R package. It has been described in the article SENTIDA: A New Tool for Sentiment Analysis in Danish (Scholia). According to their evaluation, SENTIDA beats my AFINN tool for Danish sentiment analysis.

word-intrusion
From our paper Combining embedding methods for a word intrusion task.

My own talk on Detecting the odd-one-out among Danish words and phrases with word embeddings was based on the distributional semantics representation evaluation work together with Lars Kai Hansen (Scholia): Our 2017 paper Open semantic analysis: The case of word level semantics in Danish (Scholia) and our newer 2019 paper Combining embedding methods for a word intrusion task (Scholia). The idea is to look on a Danish textual odd-one-odd task/word intrusion task and see what models trained on various corpora can do. Our current state-of-the-art is a combination of embedding models with fastText as the primary one and using Wembedder for proper nouns.

Two Aarhus students, Jan Kostkan and Malte Lau Petersen (Scholia) are downloading European parliament text data and analyzing them. A text corpora from Folketinget, the Danish Parliament may be available with 10s of millions of sentences.

Ulf Berthelsen, whom I share the Teaching platform for developing and automatically tracking early stage literacy skill (Scholia) research project with, spoke on late state literary skill.

Natalie Schluter (Scholia) spoke on glass ceiling effects in the natural language processing field. She has an associated paper The glass ceiling in NLP (Scholia) from EMNLP 2018.

Matthew Wilkens (Scholia) spoke on “Geography and gender in 20.000 British novels”, – large-scale analysis of how geography was used in British novels. This part fell much in alignment with some work I did a few years ago with geographically mapping narrative locations of Danish literature with the Littar website. and the paper Literature, Geolocation and Wikidata (Scholia) from the Wiki Workshop 2016.

Nielsen2019Detecting
Screenshot from Littar: Narrative locations of Danish literature.

There was a number of other contributions in the workshop.

The second day of the workshop featured hands-on text analysis with among others Rebekah Baglini and Matthew Wilkens getting participants to work on prepared Google Colab notebooks.

Coming Scholia, WikiCite, Wikidata and Wikipedia sessions

Posted on

In the coming months I will have three different talks on Scholia, WikiCite, Wikidata and Wikipedia at al.:

  • 3. October 2018 in DGI-byen, Copenhagen, Denmark as part of Visuals and Analytics that Matter conference, – the concluding conference for the DEFF-sponsored project Research Output & Impact Analyzed and Visualized (ROIAV).
  • 7. November 2018 in Mannheim as part of the Linked Open Citation Database (LOC-DB) 2018 workshop.
  • 13. december 2018 at the library of the Technical University of Denmark as part of Wikipedia – a media for sharing knowledge and research, an event for researchers and students (and still in the planning phase).

In september I presented Scholia as part of the Workshop on Open Citations. The slides with title Scholia as of September 2018 is available here.

A viewpoint on a viewpoint on Wikipedia’s neutral point of view

Posted on Updated on

I recently looked into what we have of Wikipedia research from Denmark and discovered several papers that I did not know about. I have now added some to Wikidata, so that Scholia can show a list of them.

Among the papers was one from Jens-Erik Mai titled Wikipedian’s knowledge and moral duties. Starting from the English Wikipedia’s Neutral Point of View (NPOV) policy, he stresses a dichotomy between the subjective and the object and argues for a rewrite of the policy. Mai claims the policy has an absolutistic center and a relativistic edge, corresponding to an absolutistic majority view and relativistic minority views.

As a long time Wikipedia editor, I find Mai’s exposition is too theoretical. I lack good exemplifications: cases where the NPOV fails, and I cannot see in what concrete way the NPOV policy should be changed to accommodate Mai’s critique. I am not sure that Wikipedians distinguish so much between the objective and the subjective; the key dichotomy is verifiability vs. not veriability, – that the statements in Wikipedia are supported by reliable sources. In terms of center-edge, I came to think of events associated with conspiracy theories. Here the “center” view could be the conventional view while the conspiracy views the edge. It is difficult for me to accommodate a standpoint that conspiracy theories should be accepted as equal as the conventional view. It is neither clear to me that the center is uncontested and uncontroversial. Wikipedia – like a newspaper – has the ability to represent opposing viewpoints. This is done by attributing the viewpoint to the reliable sources that express them. For instance, central in the description of evaluation of films are quotations from reviews of major newspapers and notable reviewers.

I don’t see the support for the claim that the NPOV policy assumes a “politically dangerous ethical position”. On the contrary, Wikipedia is now – after the increase of fake news – been called the “last bastion”. The example given in The Atlantic post is the recent social media fuzz with respect to Sarah Jeong where Wikipedians reach a work with “shared facts about reality.”

Addressing “addressing age-related bias in sentiment analysis”

Posted on Updated on

Algorithmic bias is one of the hot topics of research at the moment. There are observations of trained machine learning models that display sexism. For instance, the paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” (Scholia entry) neatly shows one example in its title with bias in word embeddings, –  shallow machine learning models trained on a large corpus of text.

A recent report investigated ageism bias in a range of sentiment analysis method, including my AFINN word list: “Addressing age-related bias in sentiment analysis” (Scholia entry). The researchers scraped sentences from blog posts and extracted those sentences with the word “old” and excluded the sentences where the word did not refer to the age of the person. They then replaced “old” with the word “young” (apparently also “older” and “oldest” was considered somehow). The example sentences they ended up with were, e.g., “It also upsets me when I realize that society expects this from old people” and “It also upsets me when I realize that society expects this from young people”. These sentences (242 in total) were submitted to 15 sentiment analysis tools and statistics was made “using multinomial log-linear regressions (via the R package nnet […])”.

I was happy to see that my AFINN was the only one in Table 4 surviving the test for all regression coefficients being non-significant. However, Table 5 with implicit age analysis showed some bias in my word list.

But after a bit of thought I wondered why there could be any kind of bias in my word list. The paper list an exponentiated intercept coefficient to be 0.733 with a 95%-confidence interval from 0.468 to 1.149 for AFINN. But if I examine what my afinn Python package reports about the words “old”, “older”, “oldest”, “young”, “younger” and “youngest”, I get all zeros, i.e., these words are not scored to be either positive or negative:

 

>>> from afinn import Afinn
>>> afinn = Afinn()
>>> afinn.score('old')
0.0
>>> afinn.score('older')
0.0
>>> afinn.score('oldest')
0.0
>>> afinn.score('young')
0.0
>>> afinn.score('younger')
0.0
>>> afinn.score('youngest')
0.0

It is thus strange why there can be any form a bias – even non-significant. For instance, for the two example sentences “It also upsets me when I realize that society expects this from old people” and “It also upsets me when I realize that society expects this from young people” my afinn Python package scores them both with the sentiment -2. This value comes solely from the word “upsets”. There can be no difference between any of the sentences when you exchange the word “old” with “young”.

In their implicit analysis of bias where they use a word embedding, there could possibly creep some bias in somewhere with my word list, although it is not clear for me how this happens.

The question is then what happens in the analysis. Does the multinomial log-linear regression give a questionable result? Could it be that I misunderstand a fundamental aspect of the paper? While som data seem to be available here, I cannot identify the specific sentences they used in the analysis.