Month: March 2011

WikiSym 2011 deadline approaching

Posted on Updated on

Apparently the WikiSym 2011 (International Symposium on Wikis and Open Collaboration) research paper deadline is approaching: It is April 1st 2011 (Now how did that date come so quickly?). This year the symposium will be held at Microsoft Research Silicon Valley in Mountain View, California, which might enable you to catch a glimpse of Leslie Lamport.

I have previously been to two WikiSym meetings: One in Odense, Denmark in 2006, the other in last year when WikiSym was co-located with Wikimania (the Wikipedia meeting) in Gdansk. Both WikiSym meetings featured so-called Open Spaces, a type of formal informal meeting (to state it clearly) or “structured coffee-break”, where issues pertaining to the topic of the meeting can be discussed on-the-fly. It creates a much more interactive environment than the usual unidirectional scientific meeting where one person lectures in front of a ocean of silent listeners. After the first WikiSym I found “ordinary” scientific meetings somewhat alienating. In the Human Brain Mapping conference Tom Nichols has sometimes (out of the formal schedule) organized poster-walk-arounds with many-to-many interaction which helps, but still WikiSym creates a forum with a quite good interaction. I was surprised to find that Wikimania does not schedule Open Spaces.

WikiSym also manages to attract a diverse set of researchers. It is not just engineers sitting with the heads in PHP and R code. You will find business school researchers and, e.g., researchers examining how wikis can be used in teaching.

Advertisements

AFINN: A new word list for sentiment analysis on Twitter

Posted on Updated on

In the Responsible Business in the Blogosphere project I have in my own sweat of the brow created a sentiment lexicon with 2477 English words (including a few phrases) each labeled with a sentiment strength and targeted towards sentiment analysis on short text as one finds in social media. It has been constructed with the help of word lists maintained by Steve DeRose (Steven J. DeRose) and Greg Siegle.

We have used my word list for sentiment analysis on Twitter in a few studies, the most notable so far is Good Friends, Bad News – Affect and Virality in Twitter. However, we have not been quite sure how well it performed compared to other sentiment lexicons such as ANEW. I have included a number of words frequently used on the Internet that I have not found in ANEW: Obscene words and Internet slang acronyms such as LOL (laughing out loud). So do these extra words make my word list better? ANEW is constructed by multiple persons rating a word and should be much better validated than my list. So maybe this list is better?

In a simple comparison between ANEW and my list I looked on the correlation with the sentiment strength (valence) of each word in the list. I have previously written about that issue. Such an analysis doesn’t really answer how good they are for sentiment analysis.

A few weeks ago Sune Lehmann mentioned that in their study they got tweets labeled for sentiment strength by the Amazon Mechanical Turk (AMT). Their study was the Twittermood study (or “Pulse of the Nation” study) that were much mentioned in the media, e.g., The New Scientist and Scientific American. See also their YouTube video.

Alan Mislove had obtained 1,000 AMT-labeled tweets that each was labeled by 10 AMT workers and rated from 1 to 9. Through Sune I got hold on the Mislove data.

With the Mislove data I have now made a more careful study of the performance of the different word lists and this study is now written up in the position paper A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. The version on our departmental homepage has the code listing.

When I measured the performance of my application of word lists with a correlation coefficient (between the AMT “ground truth” and my predictions for the sentiment of the tweet) I found that my list and ANEW were quite ahead of the word lists in General Inquirer and OpinionFinder. To be fair to the two latter word lists I should say that I did not utilize all their information for each word, — only the strength polarity. My list was slightly ahead of ANEW. Whether this is statistically significant I don’t know as I didn’t get around to perform a statistical test.

I also tried SentiStrength Web service sentiment analyzer on the 1,000 Mislove tweets. This is not just a simple word list but is a program that has, e.g., handling of emoticons, negations and spelling. This Web service showed to be the best. Slightly ahead of my list and ANEW.

I have now distributed my 2477-word list from our department server (the zip file link). During the course of evaluation I found a few embarrassing mistakes in my previous list: I thought it had 1480 words but it turned out that only 1468 were unique! Words were also sometimes in alphabetic disorder. The new list should have no such problems.

(Typo correction: 2011-03-17)

(Update 2015-08-25: If you are a Python programmer you might want to take a look at my afinn Python package available here: https://github.com/fnielsen/afinn)

LaTeX and BIBTeX are insecure

Posted on Updated on

Please note that some of the LaTeX code was not translated corrected when I moved this blogpost to WordPress.com from Posterous.com

Have you ever cursed about the obscurities of the LaTeX document preparation system? Strange error messages, float placements, wrapping around figures (with “wrapfig”), LaTeX styles, bibliography styles, not being able to remember specific commands, etc. Particularly the wrapfigure appears to me as unpredictable. It ain’t “what you write is what you get” to me. I suppose that if you understood the internal intricacies of TeX you float in heaven, but there is less hope for us that are still unsure what the meaning of this is:

hb@xt@hsize{hfilbox@tempboxahfil}%

or this:

sfcode`.@m}

One recent issue I ran into lately was escaping in BIBTeX files. Suppose you have something like this in the bibliography database:

URL = {http://www.cse.ohio-state.edu/~agrawal/788-au10/Papers/Oct28/google-fusion-socc10.pdf},

DOI = {10.1007/978-3-540-76298-0_52},

Before you find a solution you may need to try out combinations of htmladdnormallink, href and some different escapes:


URL = {http://www.cse.ohio-state.edu/~{}agrawal/788-au10/Papers/Oct28/google-fusion-socc10.pdf},

URL = {http://www.cse.ohio-state.edu/%7Eagrawal/788-au10/Papers/Oct28/google-fusion-socc10.pdf},

DOI = {10.1007/978-3-540-76298-0_52},

My BIBTeX style file contains:


FUNCTION {format.doi}
{ doi missing$
{ "" }
{ ". DOI:~htmladdnormallink{" doi *
"}{http://dx.doi.org/" *
doi *
"}" *
}
if$
}

FUNCTION {format.url}
{ url missing$
{ "" }
{ ". htmladdnormallink{Link}{" url * "}" *
}
if$
}

Another problem is line breaking with bibtex. The bibtex program wants to line break long lines from the .bib-file. Sometimes you will get into troubles with HTML linking, something like the following may be problematic with latex compiling:


htmladdnormallink{verb!10.1126/science.1199305!}{http://dx.doi.org/10%
.1126/science.1199305}.

I have occationally wondered about the security of LaTeX as I have seen something like “@openbib@code” in style files. If you “egrep -r” your file system and search the Internet you will be able to find some commands that read and write files, and you can construct a small ‘program’:

documentclass{article}
begin{document}
newwritemyout
immediateopenoutmyout=output.txt
immediatewritemyout{Hello, World}
immediatecloseoutmyout
end{document}

With this content in a .tex-file and sent through latex you end up
with a file called ‘output.txt’ containing a ‘Hello, World’. You
could then instead write funny things in a ~/.bashrc file
presumably. Not good :-(

There hasn’t been much talk about the security of the LaTeX system. Only recently I have seen a paper exploring the issue. The authors describes DoS attack via the loop command and manage to build a small virus that nicely infects all your LaTeX files. With a bit more payload it could also botnet you computer.

The issue resembles the good old macro virus available in Microsoft Word.

I don’t think we will se a security fix for this issue. It is likely to be regarded as a feature rather than a bug. PostScript seems to have the same problem (i.e., being able to manipulate files) as far as I understand, but ghostscript has the -dSAFER option to handle unsafe documents. MediaWiki’s texvc (for the math rendering) seems not to be susceptible to the problem, but the Checkoway paper shows a number of tools having issues.

Malicious LaTeX code can also be hidden in BIBTeX files. My BIBTeX file is over 80,000 lines long and it would be difficult to check it, especially considering that LaTeX commands can be written in various ways. The LaTeX can for example be converted by a small Python program to a series of catcodes:


>>> s = r"""newwritemyout
immediateopenoutmyout=output.txt
immediatewritemyout{Hello, World}
immediatecloseoutmyout"""
>>> "".join(map(lambda i: "^^%x" % ord(i), s))

So this LaTeX file will also write a file (don’t latex this code):

\documentclass{article}
\begin{document}
^^5c^^6e^^65^^77^^77^^72^^69^^74^^65^^5c^^6d^^79^^6f^^75^^74^^a^^20^^20^^5c^^69^^6d^^6d^^65^^64^^69^^61^^74^^65^^5c^^6f^^70^^65^^6e^^6f^^75^^74^^5c^^6d^^79^^6f^^75^^74^^3d^^6f^^75^^74^^70^^75^^74^^2e^^74^^78^^74^^a^^20^^20^^5c^^69^^6d^^6d^^65^^64^^69^^61^^74^^65^^5c^^77^^72^^69^^74^^65^^5c^^6d^^79^^6f^^75^^74^^7b^^48^^65^^6c^^6c^^6f^^2c^^20^^57^^6f^^72^^6c^^64^^7d^^a^^20^^20^^5c^^69^^6d^^6d^^65^^64^^69^^61^^74^^65^^5c^^63^^6c^^6f^^73^^65^^6f^^75^^74^^5c^^6d^^79^^6f^^75^^74
\end{document}