POLITICIAN! Occupation as politician is not very frequent among people in the Panama Papers. This may come as a surprise to those who had studied a bubble chart put in a post on my blog. A sizeable portion of blog readers, tweeters and probably also Facebook users seem to have seriously misunderstood it. The crucial problem with the chart is that it is made from data in Wikidata, which only contains a very limited selection of persons from the Panama Papers. Let me tell you some background and detail the problem:
- Open Knowledge Foundation Danmark hosted a 2-hours meetup in Cafe Nutid organized by Niels Erik Kaaber Rasmussen the day after the release of the Panama Papers. We were around 10 data nerds sitting with our laptops and with the provided links most if not all started downloading the Panama Papers data files with the names and company information. Some tried installing the Neo4J database which may help querying the data.
- I originally spend most of my time at the cafe looking through the data by simple means. I used something like “egrep -i denmark’ on the officers.csv file. This quick command will likely pull out most of the Danish people in the release Panama Papers. The result of the command is a small manageable list of not more than 70 listings. Among the names I recognized NO politician, neither Danish nor international.
- The Danish broadcasting company DR has had a priority access to the data. It is likely they have examined the more complete data in detail. It is also likely that if there had been a Danish politician in the Panama Papers DR would have focused on that, breaking the story. NO such story came.. Thus I think that it is unlikely that there is any Danish politicians in the more complete Panama Papers dataset.
- Among the Danish listings in the officers.csv file from the released Panama Papers we found a couple of recognizable names. Among them was the name Knud Foldschack. Already Monday, the day of the release, a Danish newssite had run a media story about that name. One Knud Foldschack is a lawyer who has involved himself in cases for leftwing causes. Having such a lawyer mentioned in the Panama Papers was a too-good-to-be-true media story, – and it was. It turned out that Knud Foldschack had no less than both a father and a brother with the same name, and the newssite now may look forward to meet one of the Foldschacks in court as he wants compensation for being wrongly smeared. His brother seems to be some sort of business man. René Bruun Lauritsen is another name within the Danish part of the Panama Papers. A person bearing that name has had unfavourable mentioning in Danish media. One of the stories was his scheme of selling semen to women in need of a pregnancy. His unauthorized handling of semen with hand delivery got him a bit of a sentence. Another scheme involved outrageous stock trading. Whether Panama-Lauritsen is the same as Semen-Lauritsen I do not know, but one would be disappointed if such an unethical businessman was not in the Panama Papers. A third name shares a fairly unique name with a Danish artist. To my knowledge Danish media had not run any story on that name. But the overall conclusion of the small sample investigated, is that politicians are not present, but names may be related to business persons and possibly an artist.
- Wikidata is a site in the Wikipedia family of sites. Though not well-known, the Wikidata site is one of the most interesting projects related to Wikipedia and in terms of main namespace pages far larger than the English Wikipedia. Wikidata may be characterized as the structured cousin of WIkipedia. Rather than edit in free-form natural language as you do in Wikipedia, in Wikidata you only edit in predefined fields. Several thousand types of fields exist. To describe a person you may use fields such as date of birth, occupation, authority identifiers, such as VIAF, homepage and sex/gender.
- So what is in Wikidata? Items corresponding to almost all Wikipedia articles appear in Wikidata – not just the articles in the English Wikipedia, but also for every language version of Wikipedia. Apart from these items which can be linked to WIkipedia articles, Wikidata also has a considerable number of other items. For instance, one Dutch user has created items for a great number of paintings for the National Gallery of Denmark, – painting which for the most part have no Wikipedia article in any language. Although Wikidata records an impressive number of items, it does not record everything. The number of persons in Wikidata is only 3276363 at the time of writing and rarely includes persons that hasn’t made his/her mark in media. The typical listing in the Panama Papers is a relative unknown man. He will unlikely appear in Wikidata. And no one adds such a person just because s/he is listed in the Panama Papers. Obviously Wikidata has an extraordinary bias against famous persons: politicians, nobility, sports people, artists, performers of any kind, etc.
- Items for persons in Wikidata who also appear in the Panama Papers can indicate a link to the Panama Papers. There is no dedicated way to do this but the ‘key event’ property has been used for that. It is apparently noted Wikimedian Gerard Meijssen who has made most of these edits. How complete it is with respect to persons in Wikidata I do not know, but Meijssen also added two Danish football players who I believe where only mentioned in Danish media. He could have relied on the English Wikipedia which had a overview of Panama Paper-listed people.
- When we have data in Wikidata, there are various ways to query the data and present them. One way use wiki whizkid Magnus Manske’s Listeria service with a query on any Wikipedia. Manske’s tool automagically builds a table with information. Wikimedia Danmark chairman Ole Palnatoke Andersen apparently had discovered Meijssen’s work on Wikidata, and Palnatoke used Manske’s tool to make a table with all people in Wikidata marked with the ‘key event’ “Panama Papers”. It only generates a fairly small list as not that many people in Wikidata are actually linked to the Panama Papers. Palnatoke also let Manske’s tool show the occupation for each person.
- Back to the Open Knowledge Foundation meeting in Copenhagen Tuesday evening: I was a bit disappointed not being able to data mine any useful information from the Panama Papers dataset. So after becoming aware of Palnatoke’s table I grabbed (stole) his query statement and modified to count the number of occupations. Wikimedia Foundation – the organization that hosts Wikipedia and Wikidata – has setup a so-called SPARQL endpoint and associated graphical interface. It allows any Web user to make powerful queries across all of Wikidata’s many millions of statements, including the limited number of statements about Panama Papers. The service is under continuous development and has in the past been somewhat unstable, but nevertheless is a very interesting service. Frontend developer Jonas Kress has in 2016 implemented several ways to display the query result. Initially it was just a plain table view, but now features results on a map – if any geocoordinates are along in the query result – and a bubble chart if there is any numerical data in the query result. Other later implemented forms of output results are timelines, multiview and networks. Making a bubble chart with counts of occupations with the SPARQL service is nothing more than a couple of lines of commands in the SPARQL language, and a push on the “Run” button. So the Panama Papers occupation bubble chart should rather be seen as a demonstration of capabilities of Wikidata and its associated services for quick queries and visualizations rather than a faithful representation of occupation of people mentioned in the released Panama Papers.
- A sizeable portion of people misunderstood the plot and regarded it as evidence of the dark deeds of politicians. Rather than a good understanding of the technical details of Wikidata, people used their preconceived opinions about politicians to interpret the bubble chart. They were helped along the way by, in my opinion, misleading title (“Panama Papers bubble chart shows politicians are most mentioned in document leak database”) and incomplete explanation in an article of The Independent. On the other hand, Le Monde had a good critical article.
- I believe my own blog where I published the plot was not to blame. It does include a SPARQL command so any knowledgeable person can see and modify the results himself/herself. Perhaps some people were confused of my blog describing me as a researcher, – and thought that this was a research result on the Panama Papers.
- My blog has in its several years of existence had 20,000 views. The single post with the Panama Papers bubble chart yielded a 10 fold increase in the number of views over the course of a few days, – my first experience with a viral post. Most referrals were from Facebook. The referral does not indicate which page on Facebook it comes from, so it is impossible to join the discussion and clarify any misunderstanding. A portion of referrals also came from Twitter and Reddit where I joined the discussion. Also social media users using the WordPress comment feature on my blog I tried to engage. On Reddit I felt a good response while for Facebook I felt it was irresponsible. Facebook boosts misconceptions and does not let me join the discussion and engage to correct any misconceptions.
- Is there anything I could have done? I could have erased my two tweets and modified my blog post introducing a warning with a stronger explanation.
Summing up my experience with the release of the Panama Papers and the subsequent viral post, I find that our politicians show not to be corrupt and do not deal with shady companies – except for a few cases. Rather it seems that loads of people had preconceived opinions about their politicians and they are willing to spread their ill-founded beliefs to the rest of the world. They have little technical understand and does not question data provenance. The problems may be augmented by Facebook.
And here is the now infamous plot:
In our Responsible Business in the Blogosphere project we are mining the blogosphere. So far we have mostly considered the microblogsphere represented by Twitter. We got two research articles on that topic: Good Friends, Bad News – Affect and Virality in Twitter and A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.
One of the reasons why we focus on Twitter is that the data we can get is structured. You get a structure in the JSON format back from the Twitter web service that is easy to handle. General blogs are somewhat more difficult to handle: First you need to find the blogs and then you need to extract the relevant data from the webpage. This is not particular easy though some interesting tools exist for these tasks.
There are some blogsites that provide relatively easy structured
information. The blogsite that I use, Posterous, provides an API that let users and programmers download information. There are actually two versions: The old one provides information in the XML format the other newer in JSON.
In my initial effort to make something useful I looked on my own blog using the old API. You need to call a URL with something like:
XML is returned. I did not manage to parse the XML in a structured way (using standard Python libraries) but used an ad hoc approach to turn the XML into a JSON-like structure with the numerical fields converted to numbers and the ‘body’ field with the actual text maintained as HTML. Apart from the postings themselves there are substructures for comments and media files that you might want to handle.
In this first application I manage to plot the number of views of each blog post as a function of the date. The two articles that got the most views are one about the Milena Penkowa case, the other with with Natalie Portman that was in the news due to her recent film Black Swan. Earlier articles that received a substantial numbers of views – more than normal – were more nerdish accounts of my problems with Ubuntu. My most recent articles have a fairly low number of views. I have several theories why that is.
How easy it is to crawl all Posterous blogs I do not yet know. Compared to Twitter the data you get are less social. In Twitter you have loads of retweets and direct messages between users that you can analyze. In Posterous you do have what corresponds to friends and followers by what is called ‘my subscriptions’. You also have comments.
The Python code that does the plotting is here:
views = [ p['views'] for p in posterous ] now = datetime.datetime.now(pytz.utc) dates = [ dateutil.parser.parse(p['date']) for p in posterous ] since = [ (now - date).days for date in dates ] plot(since, views, 'yo-') ylabel('Views') xlabel('Days since now') title('Views on Posterous') for (d,v,p),a in zip( filter(lambda (d, v, p): v > -6.64 * d + 5000, zip(since, views, posterous)), ['left', 'left', 'center', 'left']): text(d,v,p['title'][:31] + '...', horizontalalignment=a)
The last for-loop is at least PG-13 rated and should not be attempted at home.
Frontpage news on the 25 June 2010 edition of Ingeni??ren, the Danish Engineering weekly magazine, has a story on “issues” in some foreign mining companies: Newmont, Rio Tinto, Freeport McMoRan, GoldCorp, Vedanta og Anglo Platinum. Accusations put forward are environmental pollution, workers protection, forced removal of the local population, corruption and even murder. The story goes on:Watchdog DanWatch has examined contracts and found Danish companies Grundfos and FLSmidth have the mining companies as customers, e.g., selling pumps. Nothing is perhaps illegal about that. However, the two companies have stated corporate social responsibility (CSR) policies. For the investors FLSmidth writes on their website about suppliers:
In its general conditions of purchase, FLSmidth requires that subsuppliers comply with all local regulations concerning employee rights and safety and health.
And the two companies have also committed to the United Nations Global Compact, ??? a voluntary CSR commitment they can use for bragging. In interviews neither Kim N??hr Skibssted from Grundfos nor J??rgen Huno Rasmussen from FLSmidth believe that the companies have any responsibility for their customers actions. To me that would seem as a fundamental misunderstanding of what CSR commitment entails, though the issue of CSR is not so clear for a customer relation as it is for a supplier relation.In our project Responsible Business in the Blogosphere we seek to examine how CSR issues appear in social media, so how does the present story diffuse into the blogosphere? A search on FLSmidth and on Grundfos on Twitter returns nothing in relation to the story.
Searching the Facebook graph for Grundfos and FLSmidth gets you two items (one from DanWatch) and both linking to Ingeni??ren. Searching with Danish blog and news search engine Overskrift.dk I see only Ingeni??ren as well as one web page from Copenhagen Business School stating that one from the faculty has been cited in Ingeni??ren. Google blog search has nothing. And so far Ingeni??ren’s web page for the main story, Grundfos interview and the FLSmidth interview all have zero comments. It appears that very little is written about such a Danish CSR story in the blogosphere, ??? which is in line with my general impression of other Danish CSR cases. The lack of CSR talk in the blogosphere will challenge our project.