Month: December 2010

Old ANEW: A sentiment about sentiment analysis and word lists

Posted on Updated on

Anew

Affective Norms for English Words (ANEW) is one affective word list among a number of others. ANEW seems to be regarded as a sort of reference in sentiment analysis research. It records valence, arousal and dominance on 1034 words on a continous scale between 1 and 9.

A downside with ANEW is the restricted license: You are not allowed to use it in for-profit projects. Another problem is that ANEW was not developed for modern sentiment analysis. Slang words, such as ‘shit’ and ‘wow’, do not occure in ANEW. ANEW also lacks the inflection variants: It has ‘annoy’, but not  ‘annoyed’, ‘annoys’, ‘annoying’ and ‘annoyingly’. You have to do, e.g., word stemming to match words against ANEW.

I began to construct my own word list when I started to do sentiment analysis of COP15 Twitter messages and “temporal sentiment analysis”. It presently lists 1480 words with their associated valence between -5 and +5. I have inflection variants and slang words.

In the past days I have looked into the discrepancies between my list and ANEW. The figure shows a scatterplot of the valences of mine and ANEW for the words that I can match. I stemmed both ANEW and my list and listed the words that differed in positive/negative valuence. These words were: affected, aggression, aggressions, aggressive, applause, alert, alienation, brave, hard, mischief, mischiefs, profiteer, silly.

‘Brave’ and ‘applause’ I record as negative: A clear sign error in my list, that I need to correct.

‘Affected’ I have as negative (it is usually used in a negative sense), while ANEW has ‘affection’ as very positive. My word stemming has a problem here. There is a similar problem with my ‘alienation’ and ‘profiteer’ compared to ANEW’s ‘alien’ and ‘profit’.

‘Aggression’ and similar words I have as negative while ANEW has ‘aggressive’ as slightly positive which I find strange. The same pattern occures for ‘silly’. That is negative to me. ‘Hard’ I have as slightly negative while ANEW has it slightly positive. In some connotations the word is used positively, but otherwise it seems to be used in mostly negative contexts. ‘Alert’ I too have as negative, while ANEW sets it as positive. On Twitter it seems mostly to be used on a negative sense, e.g.:

Lost Pet Lost Pet Alert, have you seen Jake (link)

or

Russian Armed Forces on High Alert Over North Korea (link)

‘Mischief(s)’ I have as negative while it is slightly positive in ANEW. WordNet has two senses of the word that is clearly negative: “reckless or malicious behavior that causes discomfort or annoyance in others” and “the quality or nature of being harmful or evil”. WordNik reports a lot of definitions with one somewhat positive: “An inclination or tendency to play pranks or cause embarrassment.” On Twitter it often seems to be used ironically with a basic positive sense.

I need to fix the errors in my word list and extend it, but I think it is a good alternative to ANEW.

By the way Sune Lehmann Jørgensen has humorously called my sentiment word list AFINN as a wordpun on my name and ANEW. :-)

Advertisements

MongoDB export to a comma-separated values file

Posted on

MongoDB by default exports to JSON, but I discovered that it can also export to a comma-separated values (CSV) file. You need to specify a field list. An example with data from the streaming API of Twitter with numerous fields:

mongoexport -d twitter -c tweets --csv -f text,created_at,
  in_reply_to_status_id,coordinates,source,englishness,
  place,in_reply_to_user_id,in_reply_to_screen_name,geo,
  id.floatApprox,user.id,user.followers_count,user.location,
  user.listed_count,user.statuses_count,user.description,
  user.friends_count,user.name,user.lang,user.favourites_count,
  user.screen_name,user.url,user.created_at,user.time_zone,
  user.following -o tweets.csv

For loops, NumPy and Python optimization

Posted on Updated on

For numerical computations the book Python scripting for computational science recommends avoiding explicit loops and use vectorized NumPy expressions instead. The author Hans Petter Langtangen writes that “a speed-up factor of 10 is often gained” (page 427). The same kind of recommendation is also often give for Matlab programs.

With the timeit module short statements or expressions can be measured. The module runs a piece of code a large number of times. With that module we can measure the speed of a classic for loop, a “list comprehension” for loop and a map implementation of the same small list/array computation:

import timeit
timeit.Timer("""
x2 = [];
for x in range(10):
x2.append(x**2)
""").timeit()

timeit.Timer("""
x2 = [ x**2 for x in range(10) ]
""").timeit()

timeit.Timer("""
x2 = map(lambda x: x**2, range(10))
""").timeit()

On my 2’500 DKK Intel Atom N450 netbook the results are 13.4, 8.7 and 16.6, while our department CIMBI server reports 4.1, 2.8 and 4.5. The middle one has no function evaluations, while the two others have either a ‘append’ or a ‘lambda’ function evaluation. That might be the reason why the middle one is faster.

On small lists it seems not to be an advantage to use NumPy:

timeit.Timer("""
from numpy import asarray
x2 = list(asarray(range(10))**2)
""").timeit()

Here the result is 23.6 on the CIMBI server. That code can be optimized a bit:

timeit.Timer("""
x2 = arange(10)**2
""", setup="from numpy import arange").timeit()

It now reports 4.2, — still not quite as good as the list comprehension version. But if we increase the size of the array NumPy is far superior:

timeit.Timer("""
x2 = [ x**2 for x in range(1000) ]
""").timeit(number=100000)

timeit.Timer("""
x2 = arange(1000)**2
""", setup="from numpy import arange").timeit(number=100000)

Here the CIMBI server reports 22.6 and 0.7 in NumPy’s favor.

(correction 13 december 2010)