Month: November 2011
I my effort to beat the SentiStrength text sentiment analysis algorithm by Mike Thelwall I came up with a low-hanging fruit killer approach, — I thought. Using the standard movie review data set of Bo Pang available in NLTK (used in research papers as a benchmark data set) I would train an NTLK classifier and compare it with my valence-labeled wordlist AFINN and readjust its weights for the words a little.
What I found, however, was that for a great number of words the sentiment valence between my AFINN word list and the classifier probability trained on the movie reviews were in disagreemet. A word such as ‘distrustful’ I have as a quite negative word. However, the classifier reports the probability for ‘positive’ to be 0.87, i.e., quite positive. I examined where the word ‘distrustful’ occured in the movie review data set:
$ egrep -ir "\bdistrustful\b" ~/nltk_data/corpora/movie_reviews/
The word ‘distrustful’ appears 3 times and in all cases associated with a ‘positive’ movie review. The word is used to describe elements of the narrative or an outside reference rather than the quality of the movie itself. Another word that I have as negative is ‘criticized’. Used 10 times in the positive moview reviews (and none in the negative) I find one negation (‘the casting cannot be criticized’) but mostly the word in a contexts with the reviewer criticizing the critique of others, e.g., ‘many people have criticized fincher’s filming […], but i enjoy and relish in the portrayal’.
The top 15 ‘misaligned’ words using my ad hoc metric are listed here:
It seems that reviewers are interested in movies that have a certain amount of ‘melancholy’, ‘anger’, distrustfulness and (further down the list) scandal, apathy, hoax, struggle, hopelessness and hindrance. Whereas smile, amusement, peacefulness and gratefulness are associated with negative reviews. So are movie reviewers unempathetic schadefreudians entertained by the characters’ misfortune? Hmmm…? It reminds me of journalism where they say “a good story is a bad story”.
So much for philosophy, back to reality:
The words (such as ‘hurrah’) that have a classifier probability on 0.25 and 0.75 typically occure each only once in the corpus. In this application of the classifier I should perhaps have used a stronger prior probability so ‘hurrah’ with 0.25 would end up on around the middle of the scale with 0.5 as the probability. I haven’t checked whether it is possible to readjust the prior in the NLTK naïve Bayes classifier.
The conclusion on my Thelwallizer is not good. A straightforward application of the classifier on the movie reviews gets you features that look on the summary of the narrative rather than movie per se, so this simple approach is not particular helpful in readjustment of the weights.
However, there is another way the trained classifier can be used. Examining the most informative features I can ask if they exist in my AFINN list. The first few missing words are: slip, ludicrous, fascination, 3000, hudson, thematic, seamless, hatred, accessible, conveys, addresses, annual, incoherent, stupidity, … I cannot use ‘hudson’ in my word list, but words such as ludicrous, seamless and incoherent are surely missing.
(28 January 2012: Lookout in the code below! The way the features are constructed for the classifier is troublesome. In NLTK you should not only specify the words that appear in the text with ‘True’ you should also normally specify explicitely the words that do not appear in the text with ‘False’. Not mentioning words in the feature dictionary might be bad depending on the application)
Our Ingemar Cox asked whether I knew how to download a post from blogspot.com. It was somewhat difficult I found out.
get nostalgic when seeing a simple static HTML Web page.
Have you written an email to me? If you have please resend it. I might have deleted it… I am sorry for the trouble.Email is so central part of office life that when we as users get into email troubles it may affect our working life a good deal. I will here tell you my experience with an inconspicuous change at my work switching emailing system from a Unix-based system to a Microsoft Exchange. The sysadm went fine with all my old email (3 Gigabyte) copied, so the rest was just up to me! I have previously used an simple program called Pine, logged in via ssh. People were laughing at me at the old approach. But I knew my emails were stored at something like /var/mail/fn and /home/fn/Mail/. Now I don’t know where they are. They just come to me. I have started to use Microsoft Office Outlook Web Access (OWA), Mozilla Thunderbird and Evolution and have felt like someone needing a basic email course. I am still learning. Still trying. Whereas there was only one place to setup up the email before (Pine running at the email server) I now need to setup the email program on all computers I use. And since I presently switching between two (home and work computer) and tried both OWA, Thunderbird and Evolution I got several different systems to setup.
- Servers: First you need to setup the incoming and outgoing servers. Solved (at least I think so).
- Sent folder name: Pine and the three other email client do not seem to agree what the different folders should be called. I got ‘Sent Items’, ‘Sent’, ‘sent-mail’ and ‘Kladder’ and ‘Drafts’. My old Pine would save sent mail in ‘sent-mail’ so I imaging I could continue with that. You can change sent folder name in non-web MS Outlook, but what about OWA? It seems to always wannabe “Sent Items”. In Thunderbird you go Edit->Account Settings…->Copies & Folders->Other. This is after you have right clicked on the account name in the left side panel and subscribed to the relevant folder. After changing the Evolution folder it now seems to work. Solved.
- Folder visibility: Even though (I believe) I have setup the ‘sent-mail’ folder correctly in Evolution I cannot see it. They are stored alright as I can see them in Thunderbird. After some back and forth configuration it showed up. But then away again… In one case I emailed a person twice because I got not indication that the email was sent. Possibly solved.
- Undelete: Now I accidentally hit the – what was it? – the delete button I think, and gone is an unknown email. I do think the deleted emails go to a trash folder, but I do not now which email it is. I got an ‘Undelete message’ menu item in Evolution but it is not there when I need it. Also why do I have two ‘Trash folders’ in Evolution? Is it at all possible to have two folders with the same name in the same directory. Apparently on my system. At one point I discovered that some of the deleted messages are actually available when you unset Menu – View – Hide Deleted Messages. In Thunderbird there is only one Trash shown and emails that get deleted apparently end there. In OWA you have ‘Deleted Items’. Not clear to me.
- Synchonization: Synchronization is not clear between email server and client. I am now deleting spam mails the second time (no the third time). Perhaps Evolution and Thunderbird didn’t updated the server? Not clear what happened.
- OWA Reply format: OWA does not format reply mail the way I want it. Seemingly it does no indentation in text mode for answered mail and it puts no ‘>’. Sometimes I manually add ‘>’… How does people cope with this issue? To me it seems to be really a showstopper. Unsolved.
- Evolution reply format: In Evolution ‘>’ works alright, but sometimes the answered text is not rendered… It helps to reopen the email and then press reply again. Also the ‘>’ paragraph seems to be some specially handled by the editor.
- Calendar and contacts: Calendar and contacts not available in Evolution. It might have been possible for old Microsoft Exchange servers, but I don’t think it is possible for the new version. Evolution and Google Calender works almost, though updates in Google Calendar seems not to show up in Evolution. Unsolved. I mostly use a paper calendar.
- Evolution address book not shown. Sometimes when I am composing a new message in Evolution my contacts do not show. I close the compose email window and start composing a new email, – and then it might show up.
- Window title in Evolution. Some times the window title does not change right away. So in the message window you will have a new email shown, but with the title of the previous email that I just deleted. This is a problem when you are deleting spams and basing the deletion on the title.
- Link opening: Good old plain nuisance: In Evolution links in emails are opened in a tab in Firefox rather than a new window. I have googled the Internet empty and don’t recall finding a a solution, but now it might work for one of the computers.
- Spam: I was supprised to find that after our email system switch I still receive a lot of spam. I would have thought the Microsoft email system would collect spammails and apply machine learning algorithms to better classify spam and non-spam email. That seemed not to be the case: I do not see a spam button in the OWA interface. There are also spam issues for Thunderbird. You can mark email ‘As Junk’ with the ‘J’ or ‘junk’ button. But what on earth does that help the user. Later, I found that the email system switch might not have been fully implemented, so later I would find that the spam detection was much better.
- In Microsoft Office OWA the delete “button” moves its position in the menu depending on whether you are in list mode or reading the individual email.
- Address book: My Pine address book seems not to have come along. In OWA I both got ‘Contacts’ and ‘Kontaktpersoner’. In Evolution I have managed to get hold on Google Contacts.
- The address book of OWA is not available to Evolution. The address book of Evolution (using Gmail) is not available to OWA.
- OWA keyboard does not work: When I press ‘delete’ nothing happens.
- OWA redirects. OWA redirects external URLs to its own redirect service. I suppose that is for spam and virus monitoring. Thats is was Twitter also tells that about t.co. Of course it can also be used to monitor the users.
- Evolution hangs: So suddenly Evolution hangs in “retrieving message”. It turned out that it was because I tried the MAPI for communication with the Exchange server. After I change it back some IMAP setting are still set others are lost – as far as I can determine. I think it is solved.
- Evolution crash and slowness: Sometimes when I close an evolution window the entire program seems to either close or crash. No indication why
- Evolution slow start. When I started it takes suspiciously long time. I gives no indication why. Does it spends the time reading all the email folders? Or communicating via IMAP? My network monitor is not that active. Evolution gives no indication what it actually does during the starting process.
- Evolution recovered mailsSometimes when I start Evolutio
n I get a message box telling me to select whether I want to recover messages. At least in one case I pressed ‘recover’ and sent the message along that afterwards was displayed. Later I was told by the recipient that he already have received the message once. At other times it seems that Evolution did not send the email in the first place. This is a serious problem: I sent an important email Friday, then boot up the computer Monday finding that the email wasn’t send (I thought, until I got an email saying that it was received).
- When you reply to your own mails it is usually because you want to comment on an email you send and send the new email to that same recipient. OWA suggests your own email address and at the same time it scraps off the original recipients email. The same goes for forward. Evolution also has this problem. Annoyance.
- I got an “The specified request cannot be executed from current Application Pool” error message in OWA. After some googling it turned out to be because I had “owa/” appended too much on the URL. I think. Annoyance.
- Signature: So I would like a conditional signature: Sometimes no signature for people I email regularly, while a short or a long signature for other people. This is possible in Pine. I didn’t think it would be possible in Evolution until I discovered the button in the upper right corner. However, Evolution may erase my part of my email if I change the signature. Annoyance.
- Addresses: In Pine I would start at the address line, type a nickname, press enter and pine would find the appropriate email address of the wanted recipient. In Evolution you press Alt+T, type the nickname, tab, press enter twice. In OWA I have a very large address book from the university were just those email addresses starting with ‘A’ fills several pages. Finding an email address may take 2 or 4 mouse clicks.
- Line shifts: I have been writing blogs by composing them in Pine (or emacs and copy-pasting them into Pine) and sending the email to Posterous. Now with Evolution the line shifts in my email are no longer interpreted the way that I what it. When the post have been saved in the Posterous blog system I need to go there and edit the line shifts.
On the positive side:
- OWA is fast compared the web-based email system I have tried before. Probably as fast as Pine.
- Attachments are far easier to handle in Evolution than in Pine.
- Pine editor had some oddities, e.g., Ctrl+K would delete the entirely sentence rather than the last part from the cursor.
My issues are not specific to the system we have at our work. The problems I have are the same as quite a number of others presumably could have with the combination Microsoft Exchange Server and Linux clients. I remember (perhaps incorrectly) a userbility expert (was it Rolf Mölich) saying that in some cases users prefered graphical windows-based system rather than terminal-based systems – even though the terminal-based system could be shown to perform better in userbility studies. Perhaps because people would think the windows-based system looked fancier. Surely terminal-based Pine looks old-fashioned.
Any conclusion? Well, maybe I should just switch all my emails to Gmail? Or give Thunderbird a better test. They may be better? I guess people on Microsoft Window with Outlook may be better off. At one point I thought I got more used to the idiosyncrasies of Microsoft OWA and Evolution.
—Note that I got the email from Joseph Ho from Societe Generale Corporate & Investment Banking (Asia Pacific) that has a good deal. Furthermore, Pierre@europe-hire.net and Mauricio@europe-hire.net, you with the vacant position. Something with ‘estate property’. I also got yours. You do not need to resend you email.
Some days ago the world press was abuzz with the study on the Facebook friend graph, that found the average distance between active Facebook users to be 4.74, i.e., almost 5, meaning that there are on average 4 Facebook linked friends separating one Facebook user from another. See also brief summary on the Brede Wiki.There are standard algorithms on the shelve to compute the distance for small graphs, but because the Facebook graph is so huge you/they need special algorithms. First author Lars Backstrom employed at Facebook (that gave a keynote at 8th Extended Semantic Web Conference about the Facebook friend suggester) had the Facebook data and got hold on an algorithm from Milano researchers that could handle the 871 million active Facebook users and their 69 milliards friendship links. In a previous study the Milano researchers examined the “spid”, i.e., the variance-to-mean ratio of the distances. They claim that “spid larger than one are to be considered ‘web-like’, whereas networks with a spid smaller than one are to be considered ‘properly social’ and demonstrated that on a number of social and web networks. The Facebook study found a spid on 0.08.
I am confused somewhat by the notion of six degrees of separation. Firstly, does “degrees of separation” mean the number of persons (nodes) or the number of friendships (edge) between a source person and a target persen? Backstrom a Co. “will assume that ‘degree of separation’ is the same as ‘distance minus one’.”, that is, we are counting the persons (nodes) between source person and target person. Another problem is whether the “six” refers to
- the average distance between all pairs,
- the maximum of the average distance for each person,
- the maximum distance between all pairs (the diameter), or
- the average eccentricity; the eccentricity being the maximum distance for each person to any other person.
If you look on the first sentence on the present version of the Wikipedia article I think it alludes to the first interpretation. Playwright John Guare’s six degrees seem rather to be the third interpretation.With the co-authorship graph from the Brede Wiki I can computate these different distances. The co-authors are not fully connected but the largest connected components has 665 names, which resolve to somewhat below 665 people (I got uncorrected problems with, e.g., “Saffron A. Willis-Owen”/”Saffron A. G. Willis-Owen”). On this graph I find the mean distance to be 5.65, the mean eccentricity to be 9.37 and the diameter 12. Computing the spid I get 0.73, i.e., a “social network” according to the Milano criterion. I wonder why the average Facebook distance is so low. Jon Kleinberg mentions “weak ties”. Some of my Facebook friends are linked to public figures in Denmark. Could it be that Facebook users tend to connect with famous persons and that these famous people tend to act as hubs? Another phenomenon that I think I noticed on Facebook is that when people travel abroad and have a cursory acquaintanceship they tend to friendship on Facebook, perhaps as a kind of token and reminder. Are such brief encounters actually there and important for the low average distance?
(2012-01-16: Language correction)
While debuging my topic mining web service I came across a strange error message in the Apache error log (
[Wed Nov 23 15:31:58 2011] [error] [client 127.0.0.1] File does not exist: /var/www/ga.js, referer: http://twitter.com/
So Twitter is using Google Analytics on my Web server…? Are
something about proxy. However searching broader I found one comment stating “Hey…! I Found the solutions…. […] a host file […]”. I then suddenly came to remember that I added a line to the host file (
127.0.0.1 http://www.google-analytics.comI put that line in to avoid Google, Twitter and all the rest that uses
Google Analytics not to data mine/spy too much on my browsing habits. It
redirects Google Analytics request to my own computer. It also means
that when Twitter reports that they have 100 million active users they are actually wrong. It should be 100000001 active users.
I have previosly written about network mining in a co-author graph in connection with the actress Natalie Portman and her NeuroImage article as well as recently written about co-author mining with the data in the Brede Wiki. Now with the data from the Brede Wiki and NetworkX it is quite easy to find the shortest path between authors once the co-author data is represented in a NetworkX object. It is just a oneline Python:
path = nx.shortest_path(g, u"Finn Årup Nielsen", "Natalie Herslag")
Once printed with “for author in path: print(author)” you get:
Finn Årup Nielsen
Bruce M. Cohen
Abigail A. Baird
I presently have to misspell Natalie Portman’s surname because I entered it incorrectly in the Brede Wiki for some reason.
With the SQLite file generated from the Brede Wiki it is relatively easy to perform some simple co-author mining. First one needs to download the SQLite file from the Brede Wiki download site. Here with the unix program ‘wget’:
For a start lets find the author listed with most papers in the Brede Wiki. Starting the sqlite3 client program:
After setup (sqlite> .mode column, sqlite> .width 25) finding the most frequent mentioned authors is one line of SQL:
sqlite> SELECT value, COUNT(*) AS c FROM brede WHERE (template='paper' OR template='conference_paper') AND field = 'author' GROUP BY value ORDER BY c DESC LIMIT 10;
Finn Årup Nielsen 26
Gitte Moos Knudsen 23
Lars Kai Hansen 18
Claus Svarer 15
Olaf B. Paulson 15
Vibe Gedsø Frøkjær 13
Russell A. Poldrack 11
David Erritzøe 10
Richard S. J. Frackowiak 10
William F. C. Baaré 9
That is, e.g., ‘Russ Poldrack‘ is listed with presently 11 papers in the Brede Wiki. For performing a visualization of the co-authors one can query the SQLite database from within Python, first getting the pages with ‘paper’ and ‘conference paper’ template, then query for authors in each of these page and adding the co-authors to a NetworkX graph and draw the graph via GraphViz. The image shows the fourth connected component with 33 authors centered around Jeffrey A. Lieberman. The first connected component has 1490 authors. This number is much higher than the number of researchers in the Brede Wiki that each has a page on the own (520), see the Researcher category.
Also PageRank computation on the co-author graph is straightforward once the data is in the NetworkX graph:
for a, p in sorted(nx.pagerank(g).iteritems(), key=operator.itemgetter(1))[:-11:-1]: print('%.5f %s' % (p, a))
0.00181 Gitte Moos Knudsen
0.00176 Richard S. J. Frackowiak
0.00151 Russell A. Poldrack
0.00142 Finn Årup Nielsen
0.00137 Edward T. Bullmore
0.00134 Klaus-Peter Lesch
0.00134 Karl J. Friston
0.00130 Olaf B. Paulson
0.00128 Thomas E. Nichols
0.00121 Peter T. Fox
(Note 2011-11-18: There is an error as ‘pid’ should have been ‘tid’, i.e., “SELECT DISTINCT tid FROM…”. Using ‘pid’ instead of ‘tid’ will find all authors on a wikipage so also counting those that are cited within the ‘cite journal’ template)
(2012-01-16: Language correction)