Month: May 2011

Stop homosexuals in Danish grocery stores now! (Beware: irony in the title)

Posted on

Stop. Stop. I am trying to prepare a talk about sentiment analysis in the blogosphere and being continuously interrupted by sentiment in the blogosphere!

The talk is about A new ANEW: Evaluation of a word list for sentiment analysis in microblogs where I have developed a word list used to gauge what people feel about companies and products when they write in social media.

But alas, poor Finn. Yesterday I was interrupted by the Marmite viral story: A product I had never heard of was suddenly hot topic in social media. Brits were puzzled/annoyed/outraged by Denmark not allowing a small Copenhagen-based store selling Marmite. The problem was that the product was a fortified food item and strict Danish law for functional foods required approval of such items. Usually you hear people defending “natural food”, but in this case you would hear ordinary Brits affectionately defending a product with supplements.

Now today I am interrupted by another product/company story in the social media. The Danish grocery store chain Irma has gotten into troubles with an ad campaign targeted towards males (such as I, I suppose). It reads (as a fake note from a mother to a father concerning they son Emil):

Hi darling, Alarm!!! Emil wants to go to ballet! We must have done something wrong. Can’t you do some male-stuff. Marching. Make a fire. Should we have a barbecue this evening? When you get to Irma, please take their magazine “Krydderiet”. Kiss.

Two male ballet dancer Nikolaj H??bbe and Alexander K??lpin were not happy when interviewed to B.T. tabloid.

And soon Irma’s Facebook page was hot with comments calling for a withdrawal.

Monday Irma withdrew the commercial and “Gitte”, boss of marketing, regretted deeply their error on Irma’s Facebook page. They did not want to make a political incorrect campaign.

Two female communication experts later called out: Homophobia! Sexual discrimination! Their title was “Antifeministic antiballet-gay-rubbish” and they called the ad “just stupid”, “chocking retarded” and “discount communication”.

The two also argued that Irma is “more than a brand” and invent the notions of “Irmanism” and “Irmanist”: a choosy city dweller with an emotional attachment to the store. They write “Irma […] is not just store. Irma is an institution in the Danish society.”

I usually get my groceries in Irma and (ups!) I thought Irma was just a store. I guess am not an Irmanist.

Marketing boss Gitte’s withdrawal hit a bit back: A few people on Facebook found the ad humorous and disliked the withdrawal, and blogger Anne Sophia Hermansen noted that the humoristic analphabetism was well alive in Denmark.

Previous cases on “male” controversial commercials which played on sex are the old controversial Carlsberg poster and the male underware JBS ads with lightly dressed women. These had indeed racy pictures. Like Irma JBS stopped their campaign, but not before the pictures and the brand had been widely exposed in media discussing the case.

Here Thursday Danish social media mining site reports “Irma” as the top trend. So it seems that the JBS trick has worked well for Irma.

Marmite banned in Denmark viral story

Posted on Updated on


Have you ever heard of Marmite? I have never. Apparently (some?) British people are very fond of it.

On 24 May 2011 the “Denmark bans Marmite” story broke. I think it started with a small store in Copenhagen, “Abigail’s“, selling British products. They were banned from selling Marmite as they did not have an approval. They started a ‘Bring Back Marmite’ campaign. The story was reported on British expat website by Jason Heppenstall. From there it must have gone further to The Telegraph and Guardian. And from there it went viral in newspaper comment sections and on Twitter.

Denmark in general quickly got bad comments from angry Brits. Brits suggested banning Denmark, Danish Blue, Bacon, Lego, Carlsberg, IKEA [sic], Danish pastries, Hans Christian Andersen, extraditing Sandi Toksvig.

The next day the Danish Embassy issued an urgent press release trying to catch up on the viral story. They stated that Marmite was not banned but just needed an approval as Marmite was a fortified food. Noone had sought an approval. You can get an approval for 6’100 DKK according to the Danish food regularities. The Embassy press release didn’t stop CNN from reporting on the story and they furthermore cited Marianne Ørum, the store owner of Abigail’s:

“You can apply for permission to sell products such as Marmite, but this costs a lot of money and even then the government will probably say no.”

Denmark has traditionally had a “tough stance on functional foods”, see a 2004 article. It has hit products from Kellogg. The Guardian reported on that story back in 2004. The First Post went on to actually ask a scientist at the British Nutrition Foundation about the Marmite “ban” citing her for “There would certainly be no rationale for a ban in the UK as seen in Denmark.”

Can we compare Kellogg and Marmite? When Kellogg was banned in 2004 there was no strong social media. Now we got 306 Facebook ‘shares’ and 15 tweets on a far away New Zealand newspaper web site reporting on the Marmite story. Brits seem to be much more emotionally attached to Marmite than Kellogg Special K.


Update 2011-05-25 17:00 CEST: I should note that Kellogg now sells Special K in Denmark. If you look on their ingredients you’ll see no iron, so it might be a variant. It seems not to be correct what is stated on Wikipedia at the moment regarding Special K and Denmark (Special K is not outlawed). To get Marmite into Denmark the company behind Marmite could stop adding the extra ingredients or the Danish import company could pay the 6’100 DKK and hope for an approval. I guess bringing the product to Denmark is no problem, – it is only an issue if you sell functional food in Denmark.

I first heard of the Marmite story through Ben Goldacre. I am interesting in hearing what he has to say about functional food.


Where is the sign function in Python?

Posted on Updated on

“OMG, Python doesn’t have a sign function”, I was almost about to say. I am fairly surprised there. It has surprised others too. My suggestion is.

def sign(x):
    if x > 0:
        return 1.
    elif x < 0:
        return -1.
    elif x == 0:
        return 0.
        return x

Almost the same as KonradVoelkel’s suggestion. Others suggest “just” using math.copysign available from Python 2.6:

def sign(x): return math.copysign(1, x)

However, this implementation returns funny things for NaN values, – not NaN that I would suspect. Another suggestion also in my opinion gives the“wrong” value for NaN (and here also zero has an issue):

def sign(x): return 1 if x >= 0 else -1

“elif x == 0: return 0.” is necessary if you want to return 0.0 from an input of -0.0.

My implementation seems to be in alignment with Matlab and Octave, though they handle string input differently. Here are some example calls to thefunction with numerical input:

>>> sign(-np.nan) 
>>> sign(np.nan) 
>>> sign(np.inf) 
>>> sign(-np.inf) 
>>> sign(-2) 
>>> sign(+2) 
>>> sign(+0.0) 
>>> sign(-0.0) 
>>> sign(-np.inf) 
>>> sign(np.inf) 

Update 20 May 2011: So one should think that such a simple function couldn’t go wrong, but did: For an empty list, sign([]), and for None, sign(None), the function produce inappropriate results. Furthermore an appropriate ‘sign’ function is defined in Numpy, e.g., np.sign(np.nan) works nicely.

Hunting down the undead ghost of classical conductor George Richter

Posted on Updated on


At one point in life I acquired myself a CD with famous works of Edward Elgar: Pomp and Circumstance, Nimrod, Sospiri, the Cello and all that. I found the recording fairly good. The cover stated that the conductor was George Richter handling the London Symphony Orchestra. Googling my way on George Richter I couldn’t say that I found much. I found several references to CDs but not much about the man. London Symphony Orchestra has had a Richter as conductor: Hans Richter. Perhaps George was related? But I could not find any information about that.

I then on 6th December 2009 added George Richter to Wikipedia in the hope that someone would seek more sources. But on the 16 May 2011 a fierce Wikipedia deletionist came by the article threatening to kill poor George due to lack of references, – a cardinal sin for articles about living persons on Wikipedia. Interesting though, the deletionist had diligently discovered one single reference through Google Books: To Jonathan Brown’s (who is he?) book of 2000 “Tristan und Isolde on Record. A comprehensive discography of Wagner’s music drama with a critical introduction to the recordings.”

And now comes the spooky part.

On page xiii Brown describes our George as an “apocryphal conductor”. So what do that mean? That George didn’t get to join in on the Bible along with Mary, Moses and the rest of the band? No. As further on page 215 Brown states that one Wagnerian Richter recording is actually Heinrich Hollreiser and the recording is not as stated with London Symphony Orchestra but rather with Bamberg Symphony Orchestra. Brown – the Wagnerian discographier – had timed the different recordings of Wagner and found that Hollreiser’s recording has been issued under a number of other names: Heinrich Heller, Hans Burg, Ralph deCross, Otto Friedlich, Karl Ritter, Leon Szalar and our George Richter. Likely it seems that also the Elgar recording is not by George Richter but another yet unidentified conductor, perhaps Hollreiser?

Apparently George Richter is an invented name. Why?

My Elgar CD brands itself as an “Apollo Classics” and the company issuing the CD is “Wisepack Ltd.” in 1995. Tracking this company I find that Business Directory North Central London records Wisepack Ltd.’s address as “Unit 12. Latimer Road. London. W10 6RG” with a “PIN Tel.” 0904 049 8229. Their business is “production of records tapes and CDS”. So we are nearing. UK company search Companies House lists two entries for Wisepack: “Wisepack (1992) Limited” incorporated the 24 January 1992 and dissolved 2 April 1996 company number 02680885 with an unknown nature of business. The other entry is and “Wisepack Limited” incorporated 1988 and dissolved 29 September 1998 with company name 02245831 and on the Latimer Road address. Their nature of business is stated as “Publishing of sound recordings”, “Reproduction of sound recording” and “Wholesale electric household goods”. Peeking with Google Street Viewwe might get a glance at the Unit 12 address as it looks today. As far as I can identify the Unit 12 address today is the one with the company “city electrical factors ltd” who describe themselves as “electrical wholesalers, suppliers of electrical equipment”.

Why would the Wisepack company invent a name and not attribute the recording to the right conductor?

Jonathan Brown may give us a hint. Google Books does not show page 216 of the book but another page on the Internet may have listed the information from the Brown book. It reads: “The absence of copyright restrictions may explain why this recording has been issued under so many fictitious names”. My take on this is: The recordings with Wagner and Elgar are in the public domain and several record publishers have taken advantage of this and reissued the recordings. They change the attribution to hide the source of the original recording and thereby inventing the ghost of George Richter. We are dealing with some copyright hanky-panky.

Whether this hypothesis is correct I do not know. In support I would say that the the Wisepack logo looks like something done in DrawPerfect, – not a logo from highly esteemed company. Brown has some reservation about the identification of apocryphal conductors “because there remain a number of recordings attributed to conductors about whom very little, if anything is known”. Still I say we are dealing with a ghost. And more ghosts to come. The Elgar cello concert has a Veronique Desbois at the soloinstrument. She is likely a ghost too.

The ghost of George Richter has also conducted the “Royal Danish Symphony Orchestra” in works by Smetana and Rimsky-Korsakov as well as an overture by Gioacchino Rossini. There are two major orchestras in Copenhagen. In English they are called Danish National Symphony Orchestra and The Royal Danish Orchestra, so which one has the ghost conducted? The Rimsky-Korsakov piece is issued by the Sine Qua Non label that belongs in One Charles Street, Providence, RI, USA. A version of Elgar is apparently also issued by the GR8 label under the brand “Spectacular Classics” (wow what a name!). George Richter continues to issue CDs. As recent as in 2005 a Beethoven CD was published. Here Richter conducted London Symphony Orchestra. You can buy a Richter-Smetana CD at Amazon. This CD also has the work of conductor Henry Adolph, – another ghost according to a Anton Bruckner site

Strange things are going on in classical music. One may begin by reading the Wikipedia article about British record producer William Barrington-Coupe who according to a judge was involved in “blatant and impertinent frauds, carried out in my opinion rather clumsily.” One of his schemes exposed in 2007 involved unauthorised copies of commercial recordings. These were rerelease under his wife’s name, Joyce Hatto – and highly acclaimed. Barrington-Coupe and Hatto are real people – nonghost – though one of them is dead. The conductor in the fraud scheme is holocaust survivor René Köhler. He is likely a ghost – an invention of Barrington-Coupe – and died in 2002. The death of the George Richter has not been announced, so we may continue to hear recordings from this undead ghost, – if he is a ghost. ;-)

(2011-09-21. Minor change: spell correction)

Mining my Posterous blog: API, XML and plot

Posted on Updated on


In our Responsible Business in the Blogosphere project we are mining the blogosphere. So far we have mostly considered the microblogsphere represented by Twitter. We got two research articles on that topic: Good Friends, Bad News – Affect and Virality in Twitter and A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.

One of the reasons why we focus on Twitter is that the data we can get is structured. You get a structure in the JSON format back from the Twitter web service that is easy to handle. General blogs are somewhat more difficult to handle: First you need to find the blogs and then you need to extract the relevant data from the webpage. This is not particular easy though some interesting tools exist for these tasks.

There are some blogsites that provide relatively easy structured
information. The blogsite that I use, Posterous, provides an API that let users and programmers download information. There are actually two versions: The old one provides information in the XML format the other newer in JSON.

In my initial effort to make something useful I looked on my own blog using the old API. You need to call a URL with something like:

XML is returned. I did not manage to parse the XML in a structured way (using standard Python libraries) but used an ad hoc approach to turn the XML into a JSON-like structure with the numerical fields converted to numbers and the ‘body’ field with the actual text maintained as HTML. Apart from the postings themselves there are substructures for comments and media files that you might want to handle.

In this first application I manage to plot the number of views of each blog post as a function of the date. The two articles that got the most views are one about the Milena Penkowa case, the other with with Natalie Portman that was in the news due to her recent film Black Swan. Earlier articles that received a substantial numbers of views – more than normal – were more nerdish accounts of my problems with Ubuntu. My most recent articles have a fairly low number of views. I have several theories why that is.

How easy it is to crawl all Posterous blogs I do not yet know. Compared to Twitter the data you get are less social. In Twitter you have loads of retweets and direct messages between users that you can analyze. In Posterous you do have what corresponds to friends and followers by what is called ‘my subscriptions’. You also have comments.

The Python code that does the plotting is here:

views = [ p['views'] for p in posterous ]
now =
dates = [ dateutil.parser.parse(p['date']) for p in posterous ]
since = [ (now - date).days for date in dates ]

plot(since, views, 'yo-')
xlabel('Days since now')
title('Views on Posterous')

for (d,v,p),a in zip(
filter(lambda (d, v, p): v > -6.64 * d + 5000, zip(since, views, posterous)),
['left', 'left', 'center', 'left']):
text(d,v,p['title'][:31] + '...', horizontalalignment=a)

The last for-loop is at least PG-13 rated and should not be attempted at home.

Eurovision 2011 prediction

Posted on Updated on

For last year’s Eurovision song contest I predicted that Lena Meyer-Landrut would win, – and indeed she did. For this year I am not so sure.

I jumped in on the second semifinal of the Eurovision 2011 song contest with Austria and was not impressed. Neither did The Netherlands make an impression. Belgian acapella dont win. Slovakian twins did not quite hit the tone and Ukranian sand painting (sic!) was interesting, but the song not. Moldova was an interesting unusuality – but not particularly good. Pop-nation Sweden energetic electropopish Eric Saade had an alright song – bu his stage-singing sounded out-of-breath. Cypres started out folkish, turning a rockish, but … Bulgaria could have benefitted from Avril Lavigne on the mic. Macedonia was not good. The most amazing with that act was the large screen on the stage background put up by the German hosts. Male-turned-female Israeli former Eurovision winner Dana International reaches not his/her heights with her former effective powerpop winner. Slovenia was not Anastacia. Nice dress though. Romania had a promising verse and an ok chorus, a classic Eurovision shuffle, but probably not a winner, – best so far. Favorited Estonia presented an interesting act. Live singing could have been better. Nice orchestration in the verse. “I love Belarus” was a run-of-the-mill. Latvia brought an electric guitar and sang ok. For Danish “New Tomorrow” an ok, but as with other acts the stage singing was not precise enough, – the front spends his time running around on the stage instead of concentrating of the singing. And singback is silly when the musicians pretend to play on their instruments. Ireland’s Lipstick is the winner this semifinal. Good simplistic song. Yeah. However, the YouTube videos I could find had a muddy sound. During the reprise I heard Bosnia and Herzegovina and that was not a winner.

I didn’t hear the first semifinal, but I heard Norway had great hopes for Haba Haba – not I. It was too conventional. Then better is Haba Haba Sut Sut :-)

I could put my money on classic tenor French Amaury Vassili’s Sognu. He is sufficiently different to stand out. Remembering that that differentness has previously gotten Norway and Finland to win with folk and hard rock. Part of the piece is unfortunately running to popish.

Like last year Google has a forecast and puts Unser Star F??r D??sseldorf Lena again ahead together with Irish Jedward’s Lipstick. While Lena’s Taken by a stranger is an interesting production with non-Eurovisionistic sound the chorus doesn’t really fly high enough. It has gained quite a number of YouTube views. On the other hand one should think that the intimacy of the song disappears on the big live stage. The live performance in connection with the local German Eurovision contest show a bit of this problem.

Danish TV station DR manages a website with voting and find Sweden ahead with 20% of the votes, followed by 12% scoring Irland. Great Britain to me anonymous boy band gets 11% of the votes, Finland 7% and Bosnien-Herzegovina 5%. France gets only 4%. Also Lena gets only 4%. Sweden is way down in Google’s list and the topping on DR must be due to Danes voting for their geographic neighbor, though in terms of YouTube views Sweden has a high number, – indeed surpasses Lena. Finland’s sympathetic song is probably not strong enough to reach the top.

Bookmakers put France on odds 2.5, Irland on 6 and Lena on 22. What a discrepancy between this odds and Google’s predictions. At odds 22 Lena seems really a bargain. Azerbaijan holds the third best odds and 5 on Google.

Concluding: The prediction of Danish online voting, Google, and the bookmakers are not aligned. The most strange aspect is Google’s and bookmaker’s different opinion on Lena. Yet more confused you can get if you start comparing YouTube views. So who should we put most weight on? Initially I thought the French tenor would carry it home easily not quite having heard all songs, but now I am leaning towards the Irish madness. Ireland would also be the choice if we simply aggregate the three independent predictions: Consensus inference is good.

It is two Danes, Lars Halvor Jensen og Martin Michael Larsson, that are behind the Irish bubblegum dance. Last year a Danish composer also won by supplying Lena with the Satellite song.

Datajournalism: so has your newspaper published comma-separated values files recently?

Posted on Updated on


We usually think of the journalist as the essential part of a newspaper. Indeed they are, but other professional groups are important in the production of a newspaper. In former years workers on type settings were essential for the production of a newspaper. Now a new type of profession pops up in the news business: the data analytic computer nerd.

In Denmark the company called Kass & Mulvad brands itselves as being specialists in finding news and patterns in complex data. The two guys behind the company have a background in journalism, but in one of their articles they are not afraid of mentioning Python, the web framework Django, Google Fusion and Google Chart. They run a course: “Django for journalists”!

Kaas & Mulvad points to a couple of computer-supported journalism (“computer-assisted reporting”) efforts, e.g., the controversial Tampa Bay Mug Shots showing the faces and names of people booked in the last 24 hours in a few counties. The website is associated with St. Petersburg Times and extracts data from public information (county sheriff’s website).

In Denmark, the newspaper Information has been at the forefront in datajournalism with web developer Johannes Wehner working – not in Django – but in Drupal. Information was the only Danish media to receive the 391.832 documents Wikileaks War Logs corpus. They write (with my poor translation):

To find a path in the enormous amount of information we first and foremost constructed a searchable database, where it was possible to search in a large number of different ways, both on individual words in the text, on certain dates, on the type of report, on topics and geographical coordinates, regions, etc.

Information has also published material from the Afghanistan leak. Wehner publishes analyses of the different material on the datablog with plots and maps. For data analysts he also has published a comma-separated values file with the threat reports from Afghanistan.

My plot displays a simple histogram of the Afghanistan threat reports data (somewhat similar to one of Wehner’s plots). This plot shows an unfortunate increase in the number of threat through the years (until 2009). Danish foreign ministry has a website giving an overview of Danish achievements in Afghanistan. This is mostly positive, e.g., five million returning refugees, landmine clearing, two million girls in school. I suppose that this is not a Danish achievement alone(!), but a result of the effort of a number of countries, United States and United Kingdom, The Netherlands etc. as well as Afghanistan itself. Comparing the threat reports with the information from the Foreign Ministry there seems to be a discrepancy between negative and positive news from the different sources. Some of the discrepancy can be explained by the a type of threat: The threats of the Talaban against schools. As schools for girls become more widespread the nasty Talaban has wider opportunity to target schools. But whether these threats form a major part of the total number of threats I do not know. Information only shows around ten.

Danish computational humor (including the European Parliament)

Posted on Updated on

Last year in 2010 I looked a bit closer on Danish text mining. The text mining I have done so far has mostly been in English (see, e.g., Mining the posterior cingulate: Segregation between memory and pain components), so stop word lists and sentiment word lists are in English. I had done a bit text mining on fairy tale writer Hans Christian Andersen’s The Ugly Duckling with yet little interesting results.

To have a bit of fun I started looking on Danish humor. Researchers have done humor text mining for some time now, e.g., Rada Mihalcea has written a few papers. One is Characterizing Humour: An Exploration of Features in Humorous Texts. The simple approach is to assemble a data set of jokes, e.g., one-liners and contrast it with a non-humorous data set using a machine learning classifier. Mihalcea used Reuter news, proverbs and “British National Corpus” sentences.

Following the Mihalcea approach I gathered a small data set of just 497 jokes. Mihalcea collected 16,000 one-liners! To contrast the joke I found Danish sentences from the European Parliament available in NLTK as well as sentences from The Ugly Duckling. I then used the naïve Bayes classifier in NLTK in a straightforward manner on the three classes of texts.

Mihalcea reports that among funny features are human-centric vocabulary (you, I, woman, man, etc.), negation, negative orientation (failure, illegal, etc.), profesional communities (those poor lawyers) and human “weakness” (stupidity, alcohol, steal, lie).

Running the “show most informative features” of NLTK I finds that some of the important words for jokes to be: mand (man), manden (the man), hjem (home), sidder (sits), laver (makes), ældre (older), hedder (is called), hvorfor (why), hus (house), pludselig (suddenly), gave (present), bor (lives), dør (dies), hvornår (when), tog (train/took), spørger (asks), hvem (who). Further down the list I find advokat (lawyer). “man” is human-centric, but why is “home” and “sits” prevalent in jokes?

Whats on the word list depends much on what you contrast with, e.g., du (you) and gik (went) appear as important words for the fairytale. For the European Parliament contrasted with jokes words such as hr (Mr.), Europa, fru (Ms.), denne (this), støtte (support), disse (these) and også (also) are important.

Mihalcea uses one-liners while I uses general jokes. Often jokes are formed as a question that is why I find “why” and “when” as important joke words. The jokes scoring high with the joke classifier are also mostly questions, some examples:

  • Hvorfor sømmer man låget fast på en kiste? (Why do they nail the lid on a coffin?)
  • Hvordan smider man en affaldscontainer væk? (How do you throw away a gabbage bin?)
  • Hvis man spiser pasta og antipasta – er man så stadig lige sulten? (If you eat pasta and antipasta – are you then still hungry?)

Non-question jokes examples are:

  • Godt: Hed udendørs sex. Dårligt: Du bliver anholdt. Værre: Af din mand. (Good: Hot outdoor sex. Bad: You get arrested. Worse: By your husbond)
  • Og så var der fragtskibet, der var lastet med yoyoer. Det sank 50 gange. (And then there was the story about the ship that carried yo-yos. It sank 50 times)

Both of these follow a joke scheme: “Good, bad, worse” or “And then there was the story about”.

Among jokes classified as not a joke is the following verbose account:

“Selv om man kun måtte køre 50 km/t gennem den lille by, kørte de fleste stærkere. Man satte skilte op med tekster som ‘legende børn’, ‘vis hensyn’, og ‘skole’, men intet hjalp. Lige indtil man satte et skilt op hvor der stod: ‘nudistlejr'”

translated to:

“Even though you were only allowed to drive 50 km/h through the small town, most drivers drove faster. They put up signs with the texts ‘playing childing’, ‘show consideration’ and ‘school’, but nothing helped. Only until they put a sign saying ‘nudist camp’.”

It is funny to look on the sentences from the European Parliament corpus that gets (erroneously) a relatively high probability for being a joke. Here are some daring jokes from the European Parliament picked from the top 40:

  • Den var meget lille (It was very small)
  • De 15 er åbenbart ikke nok (The 15 were apparently not enough)
  • Fagforeningerne kommer, industrien kommer (The union comes. The industry comes)
  • Det er der heller ingen, der kan forstå (That is something noone can understand)

Ubuntu upgrading: Sometimes reinstallation is necessary

Posted on

I got a awful vaccination against Ubuntu 11.04. I attempted an upgrade from Ubuntu 10.10 on an Acer Aspire One netbook. In the middle of the process I suspect the screen saver may have crashed (sundancer was that you?) and some other things also happened. I had to do a hard reset. This left me in a state were the system wouldn’t complete the boot. After some fsck.ext4 I could get to “mount -o remount,rw” and “sudo dpkg –configure -a” which seemed to go through alright. However, there was still something fishy. The smart harddisk characterization showed some bad things. It wasn’t clear to me whether this was a bad harddisk issue or something Linux and BIOS related, e.g., with ACPI. With some parameters changed for the booting (noacpi?) I got different error messages in my log file.

Among the error message I got was “failed command: READ FPDMA QUEUED” and “ata1.00 … status: { DRDY ERR } … error { UNC }”. From what I could gather my central package database (/var/lib/dpkg/status) was affected. “sudo cp /var/lib/dpkg/status-old /var/lib/dpkg/status” could not help as the two files seemed to be the same…

Perhaps it wasn’t an Ubuntu 11.04 fault but a harddisk error trigged by the upgrade. After several hours I came to the conclusion that it was a harddisk issue. At one point I bought a wrong harddisk, so I actually had one spare harddisk and could change the bad one. With that one in place I reinstalled Ubuntu 10.10.

Setting up the computer I ran into my classic multiple workspace/edge flip problem. I cannot find the configuration for Compiz and Metacity might have a “row” bug in the workspace switcher. Brightside seems to work for Metacity for edge flipping alright.

Do the frequency of computer errors ever convergence towards zero? It is interesting to watch the plot of Debian’s release-critical bugs through time. The Debian folks can squeeze the relevant release-critical bugs fairly low around the time of release, but otherwise the frequency of bugs tends to increase…

For my comfort Open Source blogger Peter Toft reports his sentiment about an Ubuntu 11.04 upgrade. He too has run into troubles, and the people commenting on his blog share sentiment.

An occational harddisk crash is good. All the rubish you accumulate gets wiped out and you are able to start on a fresh. Unfortunately, I was able to make a backup and did. So now I have the problem of restoring the backup to the new harddisk.


(Update 5 May 2011: For ppl searching on error messages here are a couple: “Current Pending Sector Count” was bad in the smart detector that I could run with a Live CD. “The disk drive for / is not ready or not present.” and “init: udevtrigger main process (453) terminated with status 1” were among the first error messages I could see.)

(Correction: 12 July: Added a fairly important ‘not’ in the sentence “sudo cp /var/lib/dpkg/status-old /var/lib/dpkg/status” could not help as the two files seemed to be the same)

Denial of Service crawl on the Brede Wiki?

Posted on

Just as I was about to download a meta-analytic comma-separated values file from the Brede Wiki my server with the wiki got in deep trouble. Though there was some respons it was really slow. I had to do a hard reset. When I looked in the log files I could see something like “trx0undo.c … Mutex at … created file trx0rseg.c” and “InnoDB: Warning: a long semaphore wait”. I had a similar problem yesterday.

I was afraid that this might be a harddisk issue, but the harddisk utility command “smartctl -a /dev/hda1” said nothing.

If one googles with the error message a few bugs and questions shows up, but apparently not something that could help me.

Then I looked in the Apache log (/var/log/apache2/access.log) I could see aggressive download from a specific foreign university computer with several request made per second at around the time when the server got into trouble. So it might be that MediaWiki/MySQL has a problem there – not being able to handle that amount of requests. I wrote the following email to the university department:

Dear … of Computer Science,

I am recording aggressive downloads on my Web server from 999.999.999.999 which resolves to …, so it must be a computer at your site.

The amount of downloads unfortunately make my server stall, – it is a rather old computer that cannot handle much load. It is probably a bot (perhaps constructed by one of your students) that has been setup to crawl my site. I hope you can contact the person who is responsible for the bot and ask him to moderate the download rate. At the moment I am getting several request per second from the 999.999.999.999 computer.

The person behind the bot has set the agent field wrong. At the moment it display “firefox 3.0” which I very much doubt.

If it is not possible for you to contact the person I might have to setup a firewall item disabling the University of … to access my Web server.


I now also added “Crawl-delay: 3″ to the robots.txt file. I do not know how well different crawlers implement that directive.

If it is the case that the request rate has caused the problem I am a bit puzzled that MediaWiki/MySQL cannot handle that rate. It is a fairly old computer, but it should fail gracefully. Maybe I need to go over the configuration. I suppose the issue might be around “$wgDisableCounters” that I believe must require a write during the reading process. It is nice to have the download statistics but not essential.