In the Responsible Business in the Blogosphere project I have in my own sweat of the brow created a sentiment lexicon with 2477 English words (including a few phrases) each labeled with a sentiment strength and targeted towards sentiment analysis on short text as one finds in social media. It has been constructed with the help of word lists maintained by Steve DeRose (Steven J. DeRose) and Greg Siegle.
We have used my word list for sentiment analysis on Twitter in a few studies, the most notable so far is Good Friends, Bad News – Affect and Virality in Twitter. However, we have not been quite sure how well it performed compared to other sentiment lexicons such as ANEW. I have included a number of words frequently used on the Internet that I have not found in ANEW: Obscene words and Internet slang acronyms such as LOL (laughing out loud). So do these extra words make my word list better? ANEW is constructed by multiple persons rating a word and should be much better validated than my list. So maybe this list is better?
In a simple comparison between ANEW and my list I looked on the correlation with the sentiment strength (valence) of each word in the list. I have previously written about that issue. Such an analysis doesn’t really answer how good they are for sentiment analysis.
A few weeks ago Sune Lehmann mentioned that in their study they got tweets labeled for sentiment strength by the Amazon Mechanical Turk (AMT). Their study was the Twittermood study (or “Pulse of the Nation” study) that were much mentioned in the media, e.g., The New Scientist and Scientific American. See also their YouTube video.
Alan Mislove had obtained 1,000 AMT-labeled tweets that each was labeled by 10 AMT workers and rated from 1 to 9. Through Sune I got hold on the Mislove data.
With the Mislove data I have now made a more careful study of the performance of the different word lists and this study is now written up in the position paper A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. The version on our departmental homepage has the code listing.
When I measured the performance of my application of word lists with a correlation coefficient (between the AMT “ground truth” and my predictions for the sentiment of the tweet) I found that my list and ANEW were quite ahead of the word lists in General Inquirer and OpinionFinder. To be fair to the two latter word lists I should say that I did not utilize all their information for each word, — only the strength polarity. My list was slightly ahead of ANEW. Whether this is statistically significant I don’t know as I didn’t get around to perform a statistical test.
I also tried SentiStrength Web service sentiment analyzer on the 1,000 Mislove tweets. This is not just a simple word list but is a program that has, e.g., handling of emoticons, negations and spelling. This Web service showed to be the best. Slightly ahead of my list and ANEW.
I have now distributed my 2477-word list from our department server (the zip file link). During the course of evaluation I found a few embarrassing mistakes in my previous list: I thought it had 1480 words but it turned out that only 1468 were unique! Words were also sometimes in alphabetic disorder. The new list should have no such problems.
(Typo correction: 2011-03-17)
(Update 2015-08-25: If you are a Python programmer you might want to take a look at my afinn Python package available here: https://github.com/fnielsen/afinn)