Month: June 2011

Lone "The Adjectived" Aburas, The Second

Posted on Updated on

I suppose that highest aspiration of an author is to get a phone call early one morning from an English speaking person with a heavy Swedish accent. The second highest may be to become an adjective such as “Shakespearian”. Now young suburbian observer Lone Aburas (her last letter is not indicating genitive) has managed to become an adjective just after her second novel. Congratulation!

The novel is called Den svære toer, – The difficult second, i.e. second book.

Collective novel with modern social realism detailing depressing everyday life of suburbians. Not a winner? Well, the book is loaded with sufficient Danish humor and irony that we well manage. One blogger writes he has a difficulty in seeing the humor in the novel. Sorry for him. Lone Aburas clearly states that she uses irony and the use of meta-commentary was humorous. Even the title was humorous: “[…] I think it was funny […] it is mostly an ironic title […]”. The ironic meta-commentary in the beginning and the end has the clearest scoops in this direction. The end sets up tasks for the reader. The reader may, e.g., “analyze the begining of the novel” and find examples of were the author breaks the rules that she sets up. This is meant ironically, so the reader should not necessarily do that. However, the reader may already have analyzed the beginning while reading it and found out that it was ironic (the beginning). As the rules set up were meant ironic, it means that these rules were not really set up and we expect the rules to be broken, meaning that the meta-rule is that the rules are to be broken. (The obvious next step for me here would be to come up with some meta-humoristic irony in a comment to the meta-humoristic irony of Aburas. I will not do that though)

Humor with irony sits centrally in Danish popular art: Hans Christian Andersen’s fairy tales, double entendre Barbie Girl of Aqua, humorous text of long-time popular Danish music group Shubidua (selling more records than the population of Denmark). Most best selling Danish film in Denmark in the last 40 years are comedies: Olsen Gang, Den eneste ene, Italiensk for begyndere and Klovn – the movie. Even Albert Speer-lover Lars Trier most popular work in Denmark is humorous.

But apart from the irony what does the novel wants? Not clear. Lone Aburas leaves her poor characters to their own destiny with divorce and a dog training course. In the Danish hit comedy Italian for beginning we also follow Copenhagen suburbians through a course. But this course in the Italian language ends successfully with a romantic trip to Italy while Lone Aburas dog training course ends with course participants being cheated for the course fee paid up front. Not nice.

On the negative side I also find that the novel lacks an index. The punctuation I find ok though.

Advices for Lone Aburas for her third novel? Well, more structure I would say. And action! Most modern literature involves one or possible a connected series of murders, – a case to solve. A revised second edition could, e.g., consider changing the police stop on page 126 with a dramatic car hunt. Also the car crash on page 134 could be described in detail. Another issue is what she herself acknowledge on page 137 with the words: “Actually I do not like to describe two humans having sex” which is a problem as she further writes “[…] you are not a real writer if you are not capable of writing about erotics”. She needs to work on that bit. Include murder and sex. Possible also international crime and the revolution in Egypt.

Online topic mining with sentiment analysis

Posted on Updated on

I have now updated the Brede topic mining web-service with sentiment analysis using the AFINN word list.

In the example seen in the images I have a few posts from a recent query on Pfizer on Twitter. The sentiment analysis has a problem on the tweet “Pfizer’s Remoxy Fails to Win FDA Approval” as both the words “win” and “approval” are positive but in the contexts the word “fails” negates which the simple sentiment analyzer fails to detect.

(correction: 20:20)

FirstSecond

Self-citation and the Milena Penkowa and Peter Riisager case

Posted on Updated on

I have previously blogged about the Milena Penkowa case that has entertained the Danish research community in the first half of 2011. If you want an English update there is an overview in the April article Penkowa for dummies.

One of the latest to jump on the wagon for Penkowa bashing is geologist Peter Riisager. Back in March he looked on the self-citations of Penkowa and reported it on his blog. He found that 54% of Penkowa’s citations where her own. The story was picked up a couple of weeks ago by the university newspaper Danish and English as well as a Danish science web-site. When Riisager finding that Penkowa has over 50% self-citations he links to a Nature blogger that claims that “Bad guys have > 50% self-citations” and “good guys have self-citations as < 50% of total cites (I [Brian Derby] am at 25%)”. qed: Penkowa is a bad guy.

But is Riisager (and blogger Brian Derby) right? I cannot find out which method he used. 50% self-citations sounds fairly much.

How can we investigate this further? Well, here is my methodology: I use ISI Web of Science, search on an author, press “Create Citation Report” to get number of articles the author has written (“Results found”) and the number of citations (“Sum of the Times Cited”), For the number of non-self citations I press “View without self-citations” and read off “Result: ” in the upper left corner of the web-page. Is that an ok procedure? Nah. I think the problem is that “Sum of the Times Cited” refers to the number of citations while “View without self-citations” refers to the number of papers with citations without self-citations. What we should (also) do is to get the number of papers with citations (“View Citing Articles”). The problem is that there are multiple citations in each paper. What we also would like to have is the number of citations without self-citations, but I don’t know how to get that number from ISI Web of Science.

Below I have attempted a count on Milena Penkowa, Peter Riisager, myself and big shot neuroimaging analyzer Karl J. Friston. The “self-citation rate (A)” is computed what I believe is the wrong way (citations-Papers with non-self citations)/citations, while “self-citation rate (B)” is computed by the number of citing papers (Papers with citations – Papers with non-self citations)/Papers with citations.

Author Papers Citations Papers with citations Papers with non-self citations Self-citation rate (A) Self-citation rate (B)
Penkowa M 108 2482 1261 1179 52% 6.5%
Riisager P 32 372 273 254 31% 7.0%
Nielsen FA 34 649 549 533 18% 2.9%
Friston KJ 459 47381 26663 26285 46% 1.4%

In his blog post from 8 March 2011 Riisager writes that Penkowa has a total of 2,401 citations where 1296 are self-citations. With my “wrong” methodology I get 2481-1179 = 1302 self-citations, – pretty close to the numbers of Riisager. So are Riisager mixing up the units: papers and citations? Or how did he get his numbers?

The “wrong” (A)-way of computing the self-citation rate seems way off. If you take the (A) self-citation rate of Friston you get 46%. This seems to be an outragous rate. Surely of Friston’s many citations 46% is not generated by himself. That would put him near Brain Derby’s “bad guy”… As long as we do not have the number of citations without self-citations – only the number of papers with citations without self-citations – we can only use that. And if we now look on Penkowa’s self-citation rate it is not over 50% but rather 6.5%. That value is actually lower than the self-citation rate I compute for Peter Riisager! So who is laughing now?

I must admit I am not completely sure on my methodology. To investigate the issue fully one may need to download all the papers and count the citations so we can understand the ISI Web of Science values. My (B)-method gives me a self-citation rate on 2.9%. I think on Google Scholar I have a higher number of self-citations as Google Scholar is indexing all my slides. As I tend to reference myself on the slides my number of citations gets boosted, and it may partially explain why my Google Scholar h-index is higher than my ISI Web of Science h-index.

(2012-03-07: language correction)

Simplest sentiment analysis in Python with AFINN

Posted on Updated on

I have previously blogged about sentiment analysis. Code for simple sentiment analysis with my AFINN sentiment word list is also available from the appendix in the paper A new ANEW: Evaluation of a word list for sentiment analysis in microblogs as well as ready for download. It might be a little difficult to navigate the code, so here I have made the simplest example in Python of sentiment analysis with AFINN that I could think of.


#!/usr/bin/python
#
# (originally entered at https://gist.github.com/1035399)
#
# License: GPLv3
#
# To download the AFINN word list do:
# wget http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6010/zip/imm6010.zip
# unzip imm6010.zip
#
# Note that for pedagogic reasons there is a UNICODE/UTF-8 error in the code.
import math
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# AFINN-111 is as of June 2011 the most recent version of AFINN
filenameAFINN = 'AFINN/AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [
ws.strip().split('\t') for ws in open(filenameAFINN) ]))
# Word splitter pattern
pattern_split = re.compile(r"\W+")
def sentiment(text):
"""
Returns a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative valence.
"""
words = pattern_split.split(text.lower())
sentiments = map(lambda word: afinn.get(word, 0), words)
if sentiments:
# How should you weight the individual word sentiments?
# You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
else:
sentiment = 0
return sentiment
if __name__ == '__main__':
# Single sentence example:
text = "Finn is stupid and idiotic"
print("%6.2f %s" % (sentiment(text), text))
# No negation and booster words handled in this approach
text = "Finn is only a tiny bit stupid and not idiotic"
print("%6.2f %s" % (sentiment(text), text))

view raw

afinn.py

hosted with ❤ by GitHub

(2012-12-01: Updated link to new gist at github)

Social media paranoia

Posted on Updated on

In my effort to be updated on and investigate social media I got an account on one of the large Facebook-wannabee websites with social network facilities. I knew of noone on the site and was therefore fairly surprised when its friend-suggester came up with a person that I knew, – and only that person! How was the website able to know I was connected to that person? There is excessively little information on the public Internet to connect me with that person. The person is in another place, is another age and is in another business. If I google on the public web I find no pages that mention me and the person on the same page. The way that I logged into the social website was independent of other social web-sites: I didn’t explicitly tell the website about my other accounts on Twitter, Facebook, MySpace, LinkedIn, Xing, Tumblr or Posterous. Thus it could not get access to my social network through me, so the website must have gotten this relatively private information from somewhere. How?

I will come back to that. First a bit on other issues of privacy.

I recently went to the Extended Semantic Web Conference where Abe Hsuan provided one of the fine keynote talks. He focused on privacy on the Web and the “Data Valdez”. Among the topic he addressed were:

  • The Dog Poop Girl from Seoul who was photographed by an anonymous subway passenger. The girl’s dog had shit on the floor of the Seoul subway train and the girl was so embarrassed that she left it there. As the photo was released an Internet storm arose against the poor girl, her identity and personal details being revealed.
  • In the AOL Data Valdez the company released 20 million Web search queries in 2006 and with a bit of compiling journalist could reveal the identity of individual users, e.g., a 62-year old woman and her search queries on “60 single men” and other personal searches.
  • Netflix Personalization Challenge where researchers could break the anonymization in the video rental company’s data by correlating data with IMDb.
  • Pandora’s Android App that appears to send user’s birth date, gender and GPS information to advertising companies according to an analysis by Veracode.

Hsuan also pointed to a whole series of companies that specializes in correlating information across and beyond the Web: bluecava, Blue Kai, Epsilon/Abacus, TargusInfo, brilig.com, Sense Networks, Ingenix (prescription drug history, therapeutic outcomes and billing information), face.com (facial recognition technology). In April 2011 researchers reported that Apple devices stored lists of locations with timestamps without the user acknowledgment. This is just to help the user to get faster geolocation through wifi and mobil phone tower data rather than slow GPS According to Apple. However, with access to the unencrypted backup of the device you will be able to observe the travels of the device user.

Google got itself into a lawsuit after collecting and transmitting location data on the Android platform, see here.

Revealing too much about your location in public may give thieves a good opportunity and a Danish insurance company advices users to remove the Facebook Places application. There is an asymmetry in knowledge: The thieves know when you are away from your house, but the thieves are not willing to reveal that they are in your house. The interesting website Social Clusters by Morten Barklund allows you to make intelligent visualizations of your friend network from Facebook. To enable that you have to reveal your friend network to the Web service. Though eager to try it out, I was too reluctant to reveal my network. Regardless, I am probably already in the database as some people in my friend network on Facebook have signed up, i.e., I am not among the presently 102 registered users but very likely among the presently 28’513 connected persons. Registration may not be necessary to reveal you friend network. If one among your Facebook friends has an open profile some information about you is revealed even if you have a closed profile. According to a study a third of Danes on social networks regularly upload photos of people other than themselves. And among these a fourth has an open profile. So if you have more than 12 friends there is a fair chance/risk that an image of you is accessible to non-friends even if you have a closed profile and never uploaded images of yourself.

Now back to the social website that guessed right with its friend suggester. How did it do it? Here are some suggestions:

  • Facebook could have revealed my friend network to the website. This option is unlikely given that Facebook and the website is competitors.
  • The website could have obtained information from the public information on Facebook. I think this is also unlikely. Facebook would not allow a competitor to crawl its website to acquire the friend network.
  • Before I understood that Facebook applications were actually thirdparties and not just keep the data within Facebook I added a few applications. One among them was the Friend Wheel. I do not know what the Friend Wheel does with my data, but I don’t think that it has got to the social website.
  • A likely path for the data is that the other person logged in via Facebook so the other social website could get hold on the Facebook friend network. As I was in this network and my name is pretty unique the social website could match up my name with the name in the friend network.

Who knows Who knows? You can now play a Facebook application while doing research

Posted on Updated on

Whoknows

On 8th Extended Semantic Web Conference researchers from Potsdam showed a Facebook application. It is a quiz game and is called Who Knows?

The special thing about it is that the questions are automatically generated from Wikipedia via DBpedia. As users’ interaction with the game is recorded the result may be used to improve the ranking of triple data in Semantic Web applications as well as find errors in Wikipedia/DBpedia.

The background scientific paper is WhoKnows? – Evaluating Linked Data Heuristics with a Quiz that Cleans Up DBpedia. Last author Harald Sack is presently on the top of the high score list.

Another of their Facebook quiz applications is Risq.