Danish computational humor (including the European Parliament)

Posted on Updated on

Last year in 2010 I looked a bit closer on Danish text mining. The text mining I have done so far has mostly been in English (see, e.g., Mining the posterior cingulate: Segregation between memory and pain components), so stop word lists and sentiment word lists are in English. I had done a bit text mining on fairy tale writer Hans Christian Andersen’s The Ugly Duckling with yet little interesting results.

To have a bit of fun I started looking on Danish humor. Researchers have done humor text mining for some time now, e.g., Rada Mihalcea has written a few papers. One is Characterizing Humour: An Exploration of Features in Humorous Texts. The simple approach is to assemble a data set of jokes, e.g., one-liners and contrast it with a non-humorous data set using a machine learning classifier. Mihalcea used Reuter news, proverbs and “British National Corpus” sentences.

Following the Mihalcea approach I gathered a small data set of just 497 jokes. Mihalcea collected 16,000 one-liners! To contrast the joke I found Danish sentences from the European Parliament available in NLTK as well as sentences from The Ugly Duckling. I then used the naïve Bayes classifier in NLTK in a straightforward manner on the three classes of texts.

Mihalcea reports that among funny features are human-centric vocabulary (you, I, woman, man, etc.), negation, negative orientation (failure, illegal, etc.), profesional communities (those poor lawyers) and human “weakness” (stupidity, alcohol, steal, lie).

Running the “show most informative features” of NLTK I finds that some of the important words for jokes to be: mand (man), manden (the man), hjem (home), sidder (sits), laver (makes), ældre (older), hedder (is called), hvorfor (why), hus (house), pludselig (suddenly), gave (present), bor (lives), dør (dies), hvornår (when), tog (train/took), spørger (asks), hvem (who). Further down the list I find advokat (lawyer). “man” is human-centric, but why is “home” and “sits” prevalent in jokes?

Whats on the word list depends much on what you contrast with, e.g., du (you) and gik (went) appear as important words for the fairytale. For the European Parliament contrasted with jokes words such as hr (Mr.), Europa, fru (Ms.), denne (this), støtte (support), disse (these) and også (also) are important.

Mihalcea uses one-liners while I uses general jokes. Often jokes are formed as a question that is why I find “why” and “when” as important joke words. The jokes scoring high with the joke classifier are also mostly questions, some examples:

  • Hvorfor sømmer man låget fast på en kiste? (Why do they nail the lid on a coffin?)
  • Hvordan smider man en affaldscontainer væk? (How do you throw away a gabbage bin?)
  • Hvis man spiser pasta og antipasta – er man så stadig lige sulten? (If you eat pasta and antipasta – are you then still hungry?)

Non-question jokes examples are:

  • Godt: Hed udendørs sex. Dårligt: Du bliver anholdt. Værre: Af din mand. (Good: Hot outdoor sex. Bad: You get arrested. Worse: By your husbond)
  • Og så var der fragtskibet, der var lastet med yoyoer. Det sank 50 gange. (And then there was the story about the ship that carried yo-yos. It sank 50 times)

Both of these follow a joke scheme: “Good, bad, worse” or “And then there was the story about”.

Among jokes classified as not a joke is the following verbose account:

“Selv om man kun måtte køre 50 km/t gennem den lille by, kørte de fleste stærkere. Man satte skilte op med tekster som ‘legende børn’, ‘vis hensyn’, og ‘skole’, men intet hjalp. Lige indtil man satte et skilt op hvor der stod: ‘nudistlejr'”

translated to:

“Even though you were only allowed to drive 50 km/h through the small town, most drivers drove faster. They put up signs with the texts ‘playing childing’, ‘show consideration’ and ‘school’, but nothing helped. Only until they put a sign saying ‘nudist camp’.”

It is funny to look on the sentences from the European Parliament corpus that gets (erroneously) a relatively high probability for being a joke. Here are some daring jokes from the European Parliament picked from the top 40:

  • Den var meget lille (It was very small)
  • De 15 er åbenbart ikke nok (The 15 were apparently not enough)
  • Fagforeningerne kommer, industrien kommer (The union comes. The industry comes)
  • Det er der heller ingen, der kan forstå (That is something noone can understand)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s