Guess which occupation is NOT the most frequent among persons from the Panama Papers

Posted on Updated on

POLITICIAN! Occupation as politician is not very frequent among people in the Panama Papers. This may come as a surprise to those who had studied a bubble chart put in a post on my blog. A sizeable portion of blog readers, tweeters and probably also Facebook users seem to have seriously misunderstood it. The crucial problem with the chart is that it is made from data in Wikidata, which only contains a very limited selection of persons from the Panama Papers. Let me tell you some background and detail the problem:

  1. Open Knowledge Foundation Danmark hosted a 2-hours meetup in Cafe Nutid organized by Niels Erik Kaaber Rasmussen the day after the release of the Panama Papers. We were around 10 data nerds sitting with our laptops and with the provided links most if not all started downloading the Panama Papers data files with the names and company information. Some tried installing the Neo4J database which may help querying the data.
  2. I originally spend most of my time at the cafe looking through the data by simple means. I used something like “egrep -i denmark’ on the officers.csv file. This quick command will likely pull out most of the Danish people in the release Panama Papers. The result of the command is a small manageable list of not more than 70 listings. Among the names I recognized NO politician, neither Danish nor international.
  3. The Danish broadcasting company DR has had a priority access to the data. It is likely they have examined the more complete data in detail. It is also likely that if there had been a Danish politician in the Panama Papers DR would have focused on that, breaking the story. NO such story came.. Thus I think that it is unlikely that there is any Danish politicians in the more complete Panama Papers dataset.
  4. Among the Danish listings in the officers.csv file from the released Panama Papers we found a couple of recognizable names. Among them was the name Knud Foldschack. Already Monday, the day of the release, a Danish newssite had run a media story about that name. One Knud Foldschack is a lawyer who has involved himself in cases for leftwing causes. Having such a lawyer mentioned in the Panama Papers was a too-good-to-be-true media story, – and it was. It turned out that Knud Foldschack had no less than both a father and a brother with the same name, and the newssite now may look forward to meet one of the Foldschacks in court as he wants compensation for being wrongly smeared. His brother seems to be some sort of business man. René Bruun Lauritsen is another name within the Danish part of the Panama Papers. A person bearing that name has had unfavourable mentioning in Danish media. One of the stories was his scheme of selling semen to women in need of a pregnancy. His unauthorized handling of semen with hand delivery got him a bit of a sentence. Another scheme involved outrageous stock trading. Whether Panama-Lauritsen is the same as Semen-Lauritsen I do not know, but one would be disappointed if such an unethical businessman was not in the Panama Papers. A third name shares a fairly unique name with a Danish artist. To my knowledge Danish media had not run any story on that name. But the overall conclusion of the small sample investigated, is that politicians are not present, but names may be related to business persons and possibly an artist.
  5. Wikidata is a site in the Wikipedia family of sites. Though not well-known, the Wikidata site is one of the most interesting projects related to Wikipedia and in terms of main namespace pages far larger than the English Wikipedia. Wikidata may be characterized as the structured cousin of WIkipedia. Rather than edit in free-form natural language as you do in Wikipedia, in Wikidata you only edit in predefined fields. Several thousand types of fields exist. To describe a person you may use fields such as date of birth, occupation, authority identifiers, such as VIAF, homepage and sex/gender.
  6. So what is in Wikidata? Items corresponding to almost all Wikipedia articles appear in Wikidata – not just the articles in the English Wikipedia, but also for every language version of Wikipedia. Apart from these items which can be linked to WIkipedia articles, Wikidata also has a considerable number of other items. For instance, one Dutch user has created items for a great number of paintings for the National Gallery of Denmark, – painting which for the most part have no Wikipedia article in any language. Although Wikidata records an impressive number of items, it does not record everything. The number of persons in Wikidata is only 3276363 at the time of writing and rarely includes persons that hasn’t made his/her mark in media. The typical listing in the Panama Papers is a relative unknown man. He will unlikely appear in Wikidata. And no one adds such a person just because s/he is listed in the Panama Papers. Obviously Wikidata has an extraordinary bias against famous persons: politicians, nobility, sports people, artists, performers of any kind, etc.
  7. Items for persons in Wikidata who also appear in the Panama Papers can indicate a link to the Panama Papers. There is no dedicated way to do this but the  ‘key event’ property has been used for that. It is apparently noted Wikimedian Gerard Meijssen who has made most of these edits. How complete it is with respect to persons in Wikidata I do not know, but Meijssen also added two Danish football players who I believe where only mentioned in Danish media. He could have relied on the English Wikipedia which had a overview of Panama Paper-listed people.
  8. When we have data in Wikidata, there are various ways to query the data and present them. One way use wiki whizkid Magnus Manske’s Listeria service with a query on any Wikipedia. Manske’s tool automagically builds a table with information. Wikimedia Danmark chairman Ole Palnatoke Andersen apparently had discovered Meijssen’s work on Wikidata, and Palnatoke used Manske’s tool to make a table with all people in Wikidata marked with the ‘key event’ “Panama Papers”. It only generates a fairly small list as not that many people in Wikidata are actually linked to the Panama Papers. Palnatoke also let Manske’s tool show the occupation for each person.
  9. Back to the Open Knowledge Foundation meeting in Copenhagen Tuesday evening: I was a bit disappointed not being able to data mine any useful information from the Panama Papers dataset. So after becoming aware of Palnatoke’s table I grabbed (stole) his query statement and modified to count the number of occupations. Wikimedia Foundation – the organization that hosts Wikipedia and Wikidata – has setup a so-called SPARQL endpoint and associated graphical interface. It allows any Web user to make powerful queries across all of Wikidata’s many millions of statements, including the limited number of statements about Panama Papers. The service is under continuous development and has in the past been somewhat unstable, but nevertheless is a very interesting service. Frontend developer Jonas Kress has in 2016 implemented several ways to display the query result. Initially it was just a plain table view, but now features results on a map – if any geocoordinates are along in the query result – and a bubble chart if there is any numerical data in the query result. Other later implemented forms of output results are timelines, multiview and networks. Making a bubble chart with counts of occupations with the SPARQL service is nothing more than a couple of lines of commands in the SPARQL language, and a push on the “Run” button. So the Panama Papers occupation bubble chart should rather be seen as a demonstration of capabilities of Wikidata and its associated services for quick queries and visualizations rather than a faithful representation of occupation of people mentioned in the released Panama Papers.
  10. A sizeable portion of people misunderstood the plot and regarded it as evidence of the dark deeds of politicians. Rather than a good understanding of the technical details of Wikidata, people used their preconceived opinions about politicians to interpret the bubble chart. They were helped along the way by, in my opinion, misleading title (“Panama Papers bubble chart shows politicians are most mentioned in document leak database”) and incomplete explanation in an article of The Independent. On the other hand, Le Monde had a good critical article.
  11. I believe my own blog were I published the plot was not to blame. It does include a SPARQL command so any knowledgeable person can see and modify the results himself/herself. Perhaps the some people were confused of my blog describing me as a researcher, – and thought that this was a research result on the Panama Papers.
  12. My blog has in its several years of existence had 20,000 views. The single post with the Panama Papers bubble chart yielded a 10 fold increase in the number of views over the course of a few days, – my first experience with a viral post. Most referrals were from Facebook. The referral does not indicate which page on Facebook it comes from, so it is impossible to join the discussion and clarify any misunderstanding. A portion of referrals also came from Twitter and Reddit where I joined the discussion. Also social media users using the WordPress comment feature on my blog I tried to engage. On Reddit I felt a good response while for Facebook I felt it was irresponsible. Facebook boosts misconceptions and does not let me join the discussion and engage to correct any misconceptions.

    The plot of a viral post: Views on my blog around the time with the Panama Papers bubble chart publication.
  13. Is there anything I could have done? I could have erased my two tweets and modified my blog post introducing a warning with a stronger explanation.

Summing up my experience with the release of the Panama Papers and the subsequent viral post, I find that our politicians show not to be corrupt and do not deal with shady companies – except for a few cases. Rather it seems that loads of people had preconceived opinions about their politicians and they are willing to spread their ill-founded beliefs to the rest of the world. They have little technical understand and does not question data provenance. The problems may be augmented by Facebook.

And here is the now infamous plot:


The Facebook emotion contagion experiment and sentiment analysis

Posted on Updated on

The Facebook emotion contagion experiment, Experimental evidence of massive-scale emotional contagion through social networks, has caused quite a stir.

I have commented and collected a bit on the study on the Brede Wiki were there are pointers to news and blog articles as well as related research and critique.

During the brouhaha I was contacted by a Wired writer, Marcus Wohlsen. Apparently he had run into Jonty Waering which had made a Chrome browser extension as a sarcastic comment to the claims in the study. Waering had used my sentiment analysis word list for his browser extention. I discovered Wohlsen email too late for the deadline to Wohlson’s article, but Waering and his browser extension put the claims and the experiment well down to earth.

I made a few notes to Wohlsen, here slightly edited:

Simple sentiment analysis looks at individual words and surely does not necessarily capture the ‘true’ expressed emotion ignoring, say, context, negation and sarcasm. Nevertheless, many studies now show that this simple word-based approach can to some degree determine the sentiment of a text. There has been some effort for more sophisticated sentiment handling emoticons, negation and booster words, such as VERY happy as well as combining the many sentiment-labeled words there now exists. Such methods generally increase the performance of the sentiment analysis. If you have access to a data set where texts are labeled for sentiment, machine learning can boost the performance further.

It should be noted that a text does not necessarily have a definite sentiment. On short text, such as Twitter messages, even humans may not agree on sentiment.

A remaining question is how well the expressed emotion in a status message as determined by sentiment analysis corresponds to the internal emotional state of the writer. Some researchers take a one-to-one correspondence for granted and argue that you can measure ‘happiness’ [1]. But is it so? I think we are on more shaky ground. There are research projects on suicide prediction by social media monitoring. The Durkheim Project [2] focuses on predicting military and veteran suicide risk, and by the way work with Facebook for recruitment, has full IRB approval and opt-in. I see no results reported yet.

If we look outside social media a pair of researchers from Montclair State University collected song lyrics from non-suicide and suicide lyricists and created a classifier for predicting whether a song was associated with a suicide lyricist [3]. With my word-based sentiment analysis as one of the features the researchers do a reasonable good prediction. So text analytics seems to be able to predict ‘real’ emotions, – yet again with the condition ‘to some degree’ and certainly not with certainty.

I would also like to note a problem with the Facebook study as I see it. It is the issue of word burstiness. Texts are correlated. To me it is unclear if the study just shows the contagiousness of words – regardless of the emotionality.

[1] See e.g., Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter

[2] http://www.durkheimproject.org/our-project/

[3] Suicidal tendencies: the automatic classification of suicidal and non-suicidal lyricists using NLP

Facebook contagion study contagion

Posted on Updated on

No less than 4 different people alerted me on my Facebook feed to the study on sentiment analysis and Facebook feeds Experimental evidence of massive-scale emotional contagion through social networks. There is quite a number of media reports on it researchers like Tal Yarkoni and Brian Keegan have commented on it. I have put some pointers on the Brede Wiki.

Lars Kai Hansen states that he “doubt[s] emotional filtering is routine at [Faacebook].”

I think that Facebook is actually doing emotional filtering. Here is a comment I made on a thread in Facebook: Doesn’t the brouhaha show how naïve Facebook users are? These A/B-testings occur all the time. Twitter and Facebook are continuously optimizing the news feed ordering (at least I should think so). Having sentiment analysis as a component is just one feature. If they didn’t do that the data scientists would not be worth their job. If you as a user care about privacy and not being manipulated you should stop using these social media. I suppose that for the scientist involved and possibly PNAS it is a bit of an issue if the IRB is not there. Susan Fiske seems not to be quite clear on that point.

Crowdsourcing medical diagnosis via Facebook

Posted on Updated on

A study I am surprise hasn’t generated headlines is Laypersons can seek help from their Facebook friends regarding medical diagnosis (Danish: Lægfolk kan bruge deres Facebook-venner til at få hjælp vedrørende medicinske diagnoser). Hype is there alright: crowdsourcing and Facebook. Let your Facebook friends diagnose your disease.

The author of the study – 4 Danish medical doctors – found 8 willing subjects that would use their Facebook wall to post one of 6 short medical cases (see below) selected from an English text book and translated (I suppose) into Danish. The friends of the subjects could then propose a diagnosis by commenting on the post. In 6 of the 8 Facebook users (5 of the 6 stories) a correct diagnosis was suggested. The number of answers to each posted case was from 1 to 14, – the authors had thought that more Facebook friends would participate. Only one of the correct diagnoses was suggested by a medical trained person.

The authors report the median response time to correct diagnosis as 9.5 minutes. However, this is among the people that got a correct answer! If you add the people that did not get a correct diagnosis at all you get the median time to correct diagnosis to be 21 minutes calculated as median([75, 9, 8, inf, 3, 11, 31, inf]). But even 21 minutes might be quite quick compared to ordinary Danish health service. For one case that – I believe – could require surgery within hours the time to the correct answer was 3 minutes.

The authors note the number of “acceptable answers”. In information retrieval contexts of precision/recall the number of acceptable answer addresses the precision: It doesn’t help that the correct diagnosis is posted if it is overwhelmed by a large number of wrong diagnoses (false positives). The authors accepted differential diagnoses as acceptable and found rates between 14% and 100% of acceptable answers, i.e., in one case only 2 out of 14 suggestions (14%) were acceptable. One critique of this measure is that the authors regard obviously humourous diagnoses as “wrong” answers (AFAI read), e.g., one suggestion for a cause of the disease of a girl was that she was depressed due to a specific football club did not sign a proper goal keeper for the season.

The study is small but nevertheless interesting. It doesn’t show that a collective of non-experts are better than an expert as, e.g., Extracting collective probabilistic forecasts from web games. However, I think that the diagnoses were surpricingly quick.

For the broader picture the study gives an idea how sociotechnical systems may help in a welfare state.

Here are all the six medical cases translated from Danish:

  1. A 62 year old man is coughing and has had fever since he came home from India two months ago. Now there has begun to be a bit of blood with the coughing. What is he suffering from?
  2. What disease comes to mind when you read: A 38 year old guy has swollen fingers, swollen hand joints and ankles. The joints are sore and swollen and stiff for over an hour every morning?
  3. If you have pain down the right side of the stomach below the belly button (naval), what’s wrong?
  4. A 35-year-old woman has a burning sensation in diaphragm after eating, even if she only eats very little. She can no longer eat spicy food, drink coffee or chew chewing gum, what’s wrong?
  5. What do you think is wrong? A girl of 26 years lost to follow 6 kg (Correction: she lost 6 kg in weight) , feels restless and has occasional palpitations. She also has a slight swelling on the neck.
  6. An elderly gentleman has a terrible pain in the big toe base joint: It is completely white and he can not even have a blanket resting over his foot, what do you think he suffers from?

You are not allowed to look at the solution from the medical journal. Google searching is allowed. You can put your suggestion in the comment field. Bonus task is to suggest treatments.

Nine, five, six, twelve or eight degrees of separation in Brede Wiki and four in Facebook

Posted on Updated on

Some days ago the world press was abuzz with the study on the Facebook friend graph, that found the average distance between active Facebook users to be 4.74, i.e., almost 5, meaning that there are on average 4 Facebook linked friends separating one Facebook user from another. See also brief summary on the Brede Wiki.

There are standard algorithms on the shelve to compute the distance for small graphs, but because the Facebook graph is so huge you/they need special algorithms. First author Lars Backstrom employed at Facebook (that gave a keynote at 8th Extended Semantic Web Conference about the Facebook friend suggester) had the Facebook data and got hold on an algorithm from Milano researchers that could handle the 871 million active Facebook users and their 69 milliards friendship links.

In a previous study the Milano researchers examined the “spid”, i.e., the variance-to-mean ratio of the distances. They claim that “spid larger than one are to be considered ‘web-like’, whereas networks with a spid smaller than one are to be considered ‘properly social’ and demonstrated that on a number of social and web networks. The Facebook study found a spid on 0.08.

I am confused somewhat by the notion of six degrees of separation. Firstly, does “degrees of separation” mean the number of persons (nodes) or the number of friendships (edge) between a source person and a target persen? Backstrom a Co. “will assume that ‘degree of separation’ is the same as ‘distance minus one’.”, that is, we are counting the persons (nodes) between source person and target person. Another problem is whether the “six” refers to

  1. the average distance between all pairs,
  2. the maximum of the average distance for each person,
  3. the maximum distance between all pairs (the diameter), or
  4. the average eccentricity; the eccentricity being the maximum distance for each person to any other person.

If you look on the first sentence on the present version of the Wikipedia article I think it alludes to the first interpretation. Playwright John Guare’s six degrees seem rather to be the third interpretation.


With the co-authorship graph from the Brede Wiki I can computate these different distances. The co-authors are not fully connected but the largest connected components has 665 names, which resolve to somewhat below 665 people (I got uncorrected problems with, e.g., “Saffron A. Willis-Owen”/”Saffron A. G. Willis-Owen”). On this graph I find the mean distance to be 5.65, the mean eccentricity to be 9.37 and the diameter 12. Computing the spid I get 0.73, i.e., a “social network” according to the Milano criterion.

I wonder why the average Facebook distance is so low. Jon Kleinberg mentions “weak ties”. Some of my Facebook friends are linked to public figures in Denmark. Could it be that Facebook users tend to connect with famous persons and that these famous people tend to act as hubs? Another phenomenon that I think I noticed on Facebook is that when people travel abroad and have a cursory acquaintanceship they tend to friendship on Facebook, perhaps as a kind of token and reminder. Are such brief encounters actually there and important for the low average distance?


(2012-01-16: Language correction)

Social media paranoia

Posted on Updated on

In my effort to be updated on and investigate social media I got an account on one of the large Facebook-wannabee websites with social network facilities. I knew of noone on the site and was therefore fairly surprised when its friend-suggester came up with a person that I knew, – and only that person! How was the website able to know I was connected to that person? There is excessively little information on the public Internet to connect me with that person. The person is in another place, is another age and is in another business. If I google on the public web I find no pages that mention me and the person on the same page. The way that I logged into the social website was independent of other social web-sites: I didn’t explicitly tell the website about my other accounts on Twitter, Facebook, MySpace, LinkedIn, Xing, Tumblr or Posterous. Thus it could not get access to my social network through me, so the website must have gotten this relatively private information from somewhere. How?

I will come back to that. First a bit on other issues of privacy.

I recently went to the Extended Semantic Web Conference where Abe Hsuan provided one of the fine keynote talks. He focused on privacy on the Web and the “Data Valdez”. Among the topic he addressed were:

  • The Dog Poop Girl from Seoul who was photographed by an anonymous subway passenger. The girl’s dog had shit on the floor of the Seoul subway train and the girl was so embarrassed that she left it there. As the photo was released an Internet storm arose against the poor girl, her identity and personal details being revealed.
  • In the AOL Data Valdez the company released 20 million Web search queries in 2006 and with a bit of compiling journalist could reveal the identity of individual users, e.g., a 62-year old woman and her search queries on “60 single men” and other personal searches.
  • Netflix Personalization Challenge where researchers could break the anonymization in the video rental company’s data by correlating data with IMDb.
  • Pandora’s Android App that appears to send user’s birth date, gender and GPS information to advertising companies according to an analysis by Veracode.

Hsuan also pointed to a whole series of companies that specializes in correlating information across and beyond the Web: bluecava, Blue Kai, Epsilon/Abacus, TargusInfo, brilig.com, Sense Networks, Ingenix (prescription drug history, therapeutic outcomes and billing information), face.com (facial recognition technology). In April 2011 researchers reported that Apple devices stored lists of locations with timestamps without the user acknowledgment. This is just to help the user to get faster geolocation through wifi and mobil phone tower data rather than slow GPS According to Apple. However, with access to the unencrypted backup of the device you will be able to observe the travels of the device user.

Google got itself into a lawsuit after collecting and transmitting location data on the Android platform, see here.

Revealing too much about your location in public may give thieves a good opportunity and a Danish insurance company advices users to remove the Facebook Places application. There is an asymmetry in knowledge: The thieves know when you are away from your house, but the thieves are not willing to reveal that they are in your house. The interesting website Social Clusters by Morten Barklund allows you to make intelligent visualizations of your friend network from Facebook. To enable that you have to reveal your friend network to the Web service. Though eager to try it out, I was too reluctant to reveal my network. Regardless, I am probably already in the database as some people in my friend network on Facebook have signed up, i.e., I am not among the presently 102 registered users but very likely among the presently 28’513 connected persons. Registration may not be necessary to reveal you friend network. If one among your Facebook friends has an open profile some information about you is revealed even if you have a closed profile. According to a study a third of Danes on social networks regularly upload photos of people other than themselves. And among these a fourth has an open profile. So if you have more than 12 friends there is a fair chance/risk that an image of you is accessible to non-friends even if you have a closed profile and never uploaded images of yourself.

Now back to the social website that guessed right with its friend suggester. How did it do it? Here are some suggestions:

  • Facebook could have revealed my friend network to the website. This option is unlikely given that Facebook and the website is competitors.
  • The website could have obtained information from the public information on Facebook. I think this is also unlikely. Facebook would not allow a competitor to crawl its website to acquire the friend network.
  • Before I understood that Facebook applications were actually thirdparties and not just keep the data within Facebook I added a few applications. One among them was the Friend Wheel. I do not know what the Friend Wheel does with my data, but I don’t think that it has got to the social website.
  • A likely path for the data is that the other person logged in via Facebook so the other social website could get hold on the Facebook friend network. As I was in this network and my name is pretty unique the social website could match up my name with the name in the friend network.

Who knows Who knows? You can now play a Facebook application while doing research

Posted on Updated on


On 8th Extended Semantic Web Conference researchers from Potsdam showed a Facebook application. It is a quiz game and is called Who Knows?

The special thing about it is that the questions are automatically generated from Wikipedia via DBpedia. As users’ interaction with the game is recorded the result may be used to improve the ranking of triple data in Semantic Web applications as well as find errors in Wikipedia/DBpedia.

The background scientific paper is WhoKnows? – Evaluating Linked Data Heuristics with a Quiz that Cleans Up DBpedia. Last author Harald Sack is presently on the top of the high score list.

Another of their Facebook quiz applications is Risq.