Month: May 2011
Stop. Stop. I am trying to prepare a talk about sentiment analysis in the blogosphere and being continuously interrupted by sentiment in the blogosphere!The talk is about A new ANEW: Evaluation of a word list for sentiment analysis in microblogs where I have developed a word list used to gauge what people feel about companies and products when they write in social media. But alas, poor Finn. Yesterday I was interrupted by the Marmite viral story: A product I had never heard of was suddenly hot topic in social media. Brits were puzzled/annoyed/outraged by Denmark not allowing a small Copenhagen-based store selling Marmite. The problem was that the product was a fortified food item and strict Danish law for functional foods required approval of such items. Usually you hear people defending “natural food”, but in this case you would hear ordinary Brits affectionately defending a product with supplements. Now today I am interrupted by another product/company story in the social media. The Danish grocery store chain Irma has gotten into troubles with an ad campaign targeted towards males (such as I, I suppose). It reads (as a fake note from a mother to a father concerning they son Emil):
Hi darling, Alarm!!! Emil wants to go to ballet! We must have done something wrong. Can’t you do some male-stuff. Marching. Make a fire. Should we have a barbecue this evening? When you get to Irma, please take their magazine “Krydderiet”. Kiss.Two male ballet dancer Nikolaj H??bbe and Alexander K??lpin were not happy when interviewed to B.T. tabloid. And soon Irma’s Facebook page was hot with comments calling for a withdrawal. Monday Irma withdrew the commercial and “Gitte”, boss of marketing, regretted deeply their error on Irma’s Facebook page. They did not want to make a political incorrect campaign. Two female communication experts later called out: Homophobia! Sexual discrimination! Their title was “Antifeministic antiballet-gay-rubbish” and they called the ad “just stupid”, “chocking retarded” and “discount communication”. The two also argued that Irma is “more than a brand” and invent the notions of “Irmanism” and “Irmanist”: a choosy city dweller with an emotional attachment to the store. They write “Irma […] is not just store. Irma is an institution in the Danish society.” I usually get my groceries in Irma and (ups!) I thought Irma was just a store. I guess am not an Irmanist. Marketing boss Gitte’s withdrawal hit a bit back: A few people on Facebook found the ad humorous and disliked the withdrawal, and blogger Anne Sophia Hermansen noted that the humoristic analphabetism was well alive in Denmark. Previous cases on “male” controversial commercials which played on sex are the old controversial Carlsberg poster and the male underware JBS ads with lightly dressed women. These had indeed racy pictures. Like Irma JBS stopped their campaign, but not before the pictures and the brand had been widely exposed in media discussing the case. Here Thursday Danish social media mining site overskrift.dk reports “Irma” as the top trend. So it seems that the JBS trick has worked well for Irma.
“OMG, Python doesn’t have a sign function”, I was almost about to say. I am fairly surprised there. It has surprised others too. My suggestion is.
def sign(x): if x > 0: return 1. elif x < 0: return -1. elif x == 0: return 0. else: return x
def sign(x): return math.copysign(1, x)
However, this implementation returns funny things for NaN values, – not NaN that I would suspect. Another suggestion also in my opinion gives the“wrong” value for NaN (and here also zero has an issue):
def sign(x): return 1 if x >= 0 else -1
“elif x == 0: return 0.” is necessary if you want to return 0.0 from an input of -0.0.
My implementation seems to be in alignment with Matlab and Octave, though they handle string input differently. Here are some example calls to thefunction with numerical input:
>>> sign(-np.nan) nan >>> sign(np.nan) nan >>> sign(np.inf) 1.0 >>> sign(-np.inf) -1.0 >>> sign(-2) -1.0 >>> sign(+2) 1.0 >>> sign(+0.0) 0.0 >>> sign(-0.0) 0.0 >>> sign(-np.inf) -1.0 >>> sign(np.inf) 1.0
Update 20 May 2011: So one should think that such a simple function couldn’t go wrong, but did: For an empty list, sign(), and for None, sign(None), the function produce inappropriate results. Furthermore an appropriate ‘sign’ function is defined in Numpy, e.g., np.sign(np.nan) works nicely.
At one point in life I acquired myself a CD with famous works of Edward Elgar: Pomp and Circumstance, Nimrod, Sospiri, the Cello and all that. I found the recording fairly good. The cover stated that the conductor was George Richter handling the London Symphony Orchestra. Googling my way on George Richter I couldn’t say that I found much. I found several references to CDs but not much about the man. London Symphony Orchestra has had a Richter as conductor: Hans Richter. Perhaps George was related? But I could not find any information about that.I then on 6th December 2009 added George Richter to Wikipedia in the hope that someone would seek more sources. But on the 16 May 2011 a fierce Wikipedia deletionist came by the article threatening to kill poor George due to lack of references, – a cardinal sin for articles about living persons on Wikipedia. Interesting though, the deletionist had diligently discovered one single reference through Google Books: To Jonathan Brown’s (who is he?) book of 2000 “Tristan und Isolde on Record. A comprehensive discography of Wagner’s music drama with a critical introduction to the recordings.” And now comes the spooky part. On page xiii Brown describes our George as an “apocryphal conductor”. So what do that mean? That George didn’t get to join in on the Bible along with Mary, Moses and the rest of the band? No. As further on page 215 Brown states that one Wagnerian Richter recording is actually Heinrich Hollreiser and the recording is not as stated with London Symphony Orchestra but rather with Bamberg Symphony Orchestra. Brown – the Wagnerian discographier – had timed the different recordings of Wagner and found that Hollreiser’s recording has been issued under a number of other names: Heinrich Heller, Hans Burg, Ralph deCross, Otto Friedlich, Karl Ritter, Leon Szalar and our George Richter. Likely it seems that also the Elgar recording is not by George Richter but another yet unidentified conductor, perhaps Hollreiser? Apparently George Richter is an invented name. Why? My Elgar CD brands itself as an “Apollo Classics” and the company issuing the CD is “Wisepack Ltd.” in 1995. Tracking this company I find that Business Directory North Central London records Wisepack Ltd.’s address as “Unit 12. Latimer Road. London. W10 6RG” with a “PIN Tel.” 0904 049 8229. Their business is “production of records tapes and CDS”. So we are nearing. UK company search Companies House lists two entries for Wisepack: “Wisepack (1992) Limited” incorporated the 24 January 1992 and dissolved 2 April 1996 company number 02680885 with an unknown nature of business. The other entry is and “Wisepack Limited” incorporated 1988 and dissolved 29 September 1998 with company name 02245831 and on the Latimer Road address. Their nature of business is stated as “Publishing of sound recordings”, “Reproduction of sound recording” and “Wholesale electric household goods”. Peeking with Google Street Viewwe might get a glance at the Unit 12 address as it looks today. As far as I can identify the Unit 12 address today is the one with the company “city electrical factors ltd” who describe themselves as “electrical wholesalers, suppliers of electrical equipment”. Why would the Wisepack company invent a name and not attribute the recording to the right conductor? Jonathan Brown may give us a hint. Google Books does not show page 216 of the book but another page on the Internet may have listed the information from the Brown book. It reads: “The absence of copyright restrictions may explain why this recording has been issued under so many fictitious names”. My take on this is: The recordings with Wagner and Elgar are in the public domain and several record publishers have taken advantage of this and reissued the recordings. They change the attribution to hide the source of the original recording and thereby inventing the ghost of George Richter. We are dealing with some copyright hanky-panky. Whether this hypothesis is correct I do not know. In support I would say that the the Wisepack logo looks like something done in DrawPerfect, – not a logo from highly esteemed company. Brown has some reservation about the identification of apocryphal conductors “because there remain a number of recordings attributed to conductors about whom very little, if anything is known”. Still I say we are dealing with a ghost. And more ghosts to come. The Elgar cello concert has a Veronique Desbois at the soloinstrument. She is likely a ghost too. The ghost of George Richter has also conducted the “Royal Danish Symphony Orchestra” in works by Smetana and Rimsky-Korsakov as well as an overture by Gioacchino Rossini. There are two major orchestras in Copenhagen. In English they are called Danish National Symphony Orchestra and The Royal Danish Orchestra, so which one has the ghost conducted? The Rimsky-Korsakov piece is issued by the Sine Qua Non label that belongs in One Charles Street, Providence, RI, USA. A version of Elgar is apparently also issued by the GR8 label under the brand “Spectacular Classics” (wow what a name!). George Richter continues to issue CDs. As recent as in 2005 a Beethoven CD was published. Here Richter conducted London Symphony Orchestra. You can buy a Richter-Smetana CD at Amazon. This CD also has the work of conductor Henry Adolph, – another ghost according to a Anton Bruckner site Strange things are going on in classical music. One may begin by reading the Wikipedia article about British record producer William Barrington-Coupe who according to a judge was involved in “blatant and impertinent frauds, carried out in my opinion rather clumsily.” One of his schemes exposed in 2007 involved unauthorised copies of commercial recordings. These were rerelease under his wife’s name, Joyce Hatto – and highly acclaimed. Barrington-Coupe and Hatto are real people – nonghost – though one of them is dead. The conductor in the fraud scheme is holocaust survivor René Köhler. He is likely a ghost – an invention of Barrington-Coupe – and died in 2002. The death of the George Richter has not been announced, so we may continue to hear recordings from this undead ghost, – if he is a ghost. ;-)
(2011-09-21. Minor change: spell correction)
In our Responsible Business in the Blogosphere project we are mining the blogosphere. So far we have mostly considered the microblogsphere represented by Twitter. We got two research articles on that topic: Good Friends, Bad News – Affect and Virality in Twitter and A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.
One of the reasons why we focus on Twitter is that the data we can get is structured. You get a structure in the JSON format back from the Twitter web service that is easy to handle. General blogs are somewhat more difficult to handle: First you need to find the blogs and then you need to extract the relevant data from the webpage. This is not particular easy though some interesting tools exist for these tasks.
There are some blogsites that provide relatively easy structured
information. The blogsite that I use, Posterous, provides an API that let users and programmers download information. There are actually two versions: The old one provides information in the XML format the other newer in JSON.
In my initial effort to make something useful I looked on my own blog using the old API. You need to call a URL with something like:
XML is returned. I did not manage to parse the XML in a structured way (using standard Python libraries) but used an ad hoc approach to turn the XML into a JSON-like structure with the numerical fields converted to numbers and the ‘body’ field with the actual text maintained as HTML. Apart from the postings themselves there are substructures for comments and media files that you might want to handle.
In this first application I manage to plot the number of views of each blog post as a function of the date. The two articles that got the most views are one about the Milena Penkowa case, the other with with Natalie Portman that was in the news due to her recent film Black Swan. Earlier articles that received a substantial numbers of views – more than normal – were more nerdish accounts of my problems with Ubuntu. My most recent articles have a fairly low number of views. I have several theories why that is.
How easy it is to crawl all Posterous blogs I do not yet know. Compared to Twitter the data you get are less social. In Twitter you have loads of retweets and direct messages between users that you can analyze. In Posterous you do have what corresponds to friends and followers by what is called ‘my subscriptions’. You also have comments.
The Python code that does the plotting is here:
views = [ p['views'] for p in posterous ] now = datetime.datetime.now(pytz.utc) dates = [ dateutil.parser.parse(p['date']) for p in posterous ] since = [ (now - date).days for date in dates ] plot(since, views, 'yo-') ylabel('Views') xlabel('Days since now') title('Views on Posterous') for (d,v,p),a in zip( filter(lambda (d, v, p): v > -6.64 * d + 5000, zip(since, views, posterous)), ['left', 'left', 'center', 'left']): text(d,v,p['title'][:31] + '...', horizontalalignment=a)
The last for-loop is at least PG-13 rated and should not be attempted at home.
For last year’s Eurovision song contest I predicted that Lena Meyer-Landrut would win, – and indeed she did. For this year I am not so sure.
I jumped in on the second semifinal of the Eurovision 2011 song contest with Austria and was not impressed. Neither did The Netherlands make an impression. Belgian acapella dont win. Slovakian twins did not quite hit the tone and Ukranian sand painting (sic!) was interesting, but the song not. Moldova was an interesting unusuality – but not particularly good. Pop-nation Sweden energetic electropopish Eric Saade had an alright song – bu his stage-singing sounded out-of-breath. Cypres started out folkish, turning a rockish, but … Bulgaria could have benefitted from Avril Lavigne on the mic. Macedonia was not good. The most amazing with that act was the large screen on the stage background put up by the German hosts. Male-turned-female Israeli former Eurovision winner Dana International reaches not his/her heights with her former effective powerpop winner. Slovenia was not Anastacia. Nice dress though. Romania had a promising verse and an ok chorus, a classic Eurovision shuffle, but probably not a winner, – best so far. Favorited Estonia presented an interesting act. Live singing could have been better. Nice orchestration in the verse. “I love Belarus” was a run-of-the-mill. Latvia brought an electric guitar and sang ok. For Danish “New Tomorrow” an ok, but as with other acts the stage singing was not precise enough, – the front spends his time running around on the stage instead of concentrating of the singing. And singback is silly when the musicians pretend to play on their instruments. Ireland’s Lipstick is the winner this semifinal. Good simplistic song. Yeah. However, the YouTube videos I could find had a muddy sound. During the reprise I heard Bosnia and Herzegovina and that was not a winner.
I could put my money on classic tenor French Amaury Vassili’s Sognu. He is sufficiently different to stand out. Remembering that that differentness has previously gotten Norway and Finland to win with folk and hard rock. Part of the piece is unfortunately running to popish.
Like last year Google has a forecast and puts Unser Star F??r D??sseldorf Lena again ahead together with Irish Jedward’s Lipstick. While Lena’s Taken by a stranger is an interesting production with non-Eurovisionistic sound the chorus doesn’t really fly high enough. It has gained quite a number of YouTube views. On the other hand one should think that the intimacy of the song disappears on the big live stage. The live performance in connection with the local German Eurovision contest show a bit of this problem.
Danish TV station DR manages a website with voting and find Sweden ahead with 20% of the votes, followed by 12% scoring Irland. Great Britain to me anonymous boy band gets 11% of the votes, Finland 7% and Bosnien-Herzegovina 5%. France gets only 4%. Also Lena gets only 4%. Sweden is way down in Google’s list and the topping on DR must be due to Danes voting for their geographic neighbor, though in terms of YouTube views Sweden has a high number, – indeed surpasses Lena. Finland’s sympathetic song is probably not strong enough to reach the top.
Bookmakers put France on odds 2.5, Irland on 6 and Lena on 22. What a discrepancy between this odds and Google’s predictions. At odds 22 Lena seems really a bargain. Azerbaijan holds the third best odds and 5 on Google.
Concluding: The prediction of Danish online voting, Google, and the bookmakers are not aligned. The most strange aspect is Google’s and bookmaker’s different opinion on Lena. Yet more confused you can get if you start comparing YouTube views. So who should we put most weight on? Initially I thought the French tenor would carry it home easily not quite having heard all songs, but now I am leaning towards the Irish madness. Ireland would also be the choice if we simply aggregate the three independent predictions: Consensus inference is good.
We usually think of the journalist as the essential part of a newspaper. Indeed they are, but other professional groups are important in the production of a newspaper. In former years workers on type settings were essential for the production of a newspaper. Now a new type of profession pops up in the news business: the data analytic computer nerd.In Denmark the company called Kass & Mulvad brands itselves as being specialists in finding news and patterns in complex data. The two guys behind the company have a background in journalism, but in one of their articles they are not afraid of mentioning Python, the web framework Django, Google Fusion and Google Chart. They run a course: “Django for journalists”! Kaas & Mulvad points to a couple of computer-supported journalism (“computer-assisted reporting”) efforts, e.g., the controversial Tampa Bay Mug Shots showing the faces and names of people booked in the last 24 hours in a few counties. The website is associated with St. Petersburg Times and extracts data from public information (county sheriff’s website). In Denmark, the newspaper Information has been at the forefront in datajournalism with web developer Johannes Wehner working – not in Django – but in Drupal. Information was the only Danish media to receive the 391.832 documents Wikileaks War Logs corpus. They write (with my poor translation):
Information has also published material from the Afghanistan leak. Wehner publishes analyses of the different material on the datablog with plots and maps. For data analysts he also has published a comma-separated values file with the threat reports from Afghanistan. My plot displays a simple histogram of the Afghanistan threat reports data (somewhat similar to one of Wehner’s plots). This plot shows an unfortunate increase in the number of threat through the years (until 2009). Danish foreign ministry has a website giving an overview of Danish achievements in Afghanistan. This is mostly positive, e.g., five million returning refugees, landmine clearing, two million girls in school. I suppose that this is not a Danish achievement alone(!), but a result of the effort of a number of countries, United States and United Kingdom, The Netherlands etc. as well as Afghanistan itself. Comparing the threat reports with the information from the Foreign Ministry there seems to be a discrepancy between negative and positive news from the different sources. Some of the discrepancy can be explained by the a type of threat: The threats of the Talaban against schools. As schools for girls become more widespread the nasty Talaban has wider opportunity to target schools. But whether these threats form a major part of the total number of threats I do not know. Information only shows around ten.
To find a path in the enormous amount of information we first and foremost constructed a searchable database, where it was possible to search in a large number of different ways, both on individual words in the text, on certain dates, on the type of report, on topics and geographical coordinates, regions, etc.