wikipedia

Sentiment colored sequential collaboration network

Posted on Updated on

Nielsen2013realtime

Sentiment colored sequential collaboration network of some of the Wikipedians editing the Wikipedia articles associated with the Lundbeck company. Red are negative sentiment, green are positive.

The “sequential collaboration network” is inspired by Analyzing the creative editing behavior of Wikipedia editors: through dynamic social network analysis. Brian Keegan has also done similar kind of network visualization.

Sentiment analysis is based on the AFINN word list.

Jean-Pierre Hombach and Amazon.com: Large-scale Wikipedia copyright infringers?

Posted on Updated on

An entity calling itself “Jean-Pierre Hombach” presents itself with “I’m a German writer Comedian and short filmmaker. I’m studying media at the University of Vic.”

The profile on Twitter states “Jean-Pierre Hombach: I’m a German Hobbie writer Comedian and short filmmaker. I’m studying media at the University of Vic. Jean-Pierre speaks fluent English. Rio de Janeiro · http://goo.gl/bFdsV“. There are also a Google Plus account and a Facebook account linked.

The shortlink leads to Amazon.com that lists 17 works. All these have been published in the first part of 2012. What a prolific writer!

If you go to the Justin Bieber book on Google Books you will find “Copyright (C) Jean-Pierre Hombach. All rights reserved. ISBN 978-1-4710-8069-2″. So apparently this Hombach takes the copyright for the work.

If you go a bit further in the book you will read as the first line “Justin Drew Bieber is a Canadian pop/R&B singer, songwriter and actor.” That sounds awfully wikipedish and an examination of the book quickly reveals that this is Wikipedia! “Hombach” has simply aggregated a lot of Wikipedia articles together. If you go all the way to page 505 you will even see that the Jakarta Wikipedia article has been included in the Justin Bieber book… ehhh…?

I leave it as a execise to the reader to examine the rest of the books of Mr. “Hombach”. You may, e.g., begin with the Bob Marley book.

Obviously the copyright does not belong the “Hombach”, but to Wikipedia contributors. It is licensed under CC BY-SA and should be stated so according to the license (and re-licensed under the same license). Otherwise it is not even Copyfraud it is simply Copyright infringement.

Amazingly, a book by “Jean-Pierre” reached number 16 on the music biographies bestseller list according to Los Angeles Times. In that book the contributors are listed in the back and there might also be the CC-license although the page is not available to me on Google Books. Maybe he have read a bit about the CC license.

Amazon.com will gladly sell that to you for $23.90 without telling you that the author is not Jean-Pierre.

One interesting issue to note is that “Hombach” copied Wikipedia hoaxster Legolas2186 material on Lady Gaga. Initially it confused me as the “Hombach” book was stated to be copyrighted in 2010 while Legolas2186 hoaxster first added the segment to Wikipedia in the summer of 2011.

To me the wrongful attribution, lack of proper attribution and obfuscation (wrt. copyright year) seem illegal. Wikipedia contributors to the respective works should be able to sue Hombach and Amazon.com for selling their copyrighted works that are not appropriately licensed.

Update 2013-01-25: A Google search on Jean-Pierre Hombach reveals that the works of Jean-Pierre has at least been used five time as a source in Wikipedia, i.e., we have a citation circle! One time in Belieber and one time in Decca Records.

Update 2013-01-25: Apparently Wikipedia has a page for everything http://en.wikipedia.org/wiki/Wikipedia:Republishers Thanks to  Gadget850 (Ed)

More on automated sentiment analysis of Danish politicians on Wikipedia

Posted on Updated on

Previously today I put up sentiment analysis of Danish politician Ellen Trane Nørby on the text of the Danish Wikipedia.

Unfortunately, I could not resist the temptation of spending a bit of time on also running the analysis for some other Danish Politicians. I did it for Prime Minister Helle Thorning-Schmidt, former Prime Minister Lars Løkke Rasmussen, former Foreign Minister Lene Espersen and former Minister Ole Sohn.

For Rasmussens article we see a neutral factual bibliographic article until 2008, though with a slight increase in the end of 2007 when he became Minister of Finance. Then in May 2008 we see a drop in sentiment with the introduction of a paragraph mentioning an “issue” related to his use of county funds for private purposes. Since then the article has been extended and now generally positive. There are some spikes in the plots. These spikes are typically vandalism that persist for a few minutes until reverted.

For Helle Thorning-Schmidt we see a gradual drop up towards the election she wins and after that her article gains considerable positivity. I haven’t check up much on this in the history, but I believe it is related to the tax issue her and her husbond, movie star Stephen Kinnock, had and a number of other issues. As I remember there was concern or discussion on the Danish Wikipedia on whether these “issues” should fill up so large a portion of the article and on the 3 December 2011 a user moved the content to another page.

I believe I am one of the major perpetrators behind both the Lene Espersen and Ole Sohn articles. Both of the articles have large sections which describe negative issues (I really must work on my positivity, these politicians are not that bad). However, the sentiment analysis shows the Ole Sohn article as more positive. Maybe this is due to the “controversy” section described that he paid “tribute” to East Germany and that his party received “support” from Moscow, i.e., my simple sentiment analysis does not understand the controversial aspect of support from communist Moscow and just think that “support” is positive.

Writing politicians article on Wikipedia I find it somewhat difficult to identify good positive articles that can be used as sources. The sources used for the encyclopedic articles usually comes from news articles and these have often a negative bias with a focus on “issues” (political problems). Writing the Lene Espersen article I found that even the book “Bare kald mig Lene”, which I have used a source, has a negativity bias. If I remember correctly Espersen did not want to participate in the development of the book, presumably because she already had the notion that the writers would focus on the problematic “issues” in her career.

Nielsen2013python_llrNielsen2013python_htsNielsen2013python_leneespersenNielsen2013python_olesohn

(2013-01-10: spell correction)

Sentiment analysis of Wikipedia pages on Danish politicians

Posted on Updated on

Nielsen2013python_ellentrane

We are presently analyzing company articles on Wikipedia with simple sentiment analysis to determine how well we see any interesting patterns, e.g., whether the Wikipedia sentiment correlates with real world attitudes and events with relation to the company. Such analyses might uncover that there was a small edit war in relation to Lundbeck articles in the beginning of December 2012. We are also able to see that the Arlas Foods article was affected by the Muhammed Cartoon Crisis and the 2008 Chinese milk scandal.

 

In Denmark in the beginning of January 2013 there has been media buzz on Danish politicians and their staff doing biased edits in the Danish Wikipedia. The story carried forth by journalist Lars Fogt focused initially on Ellen Trane Nørby.

 

It is relatively easy to turn our methods employed for companies to Danish politicians. The sentiment analysis works by matching words to a word list labeled with “valence”. The initial word list worked only for English, but I have translated it to Danish and continuously extend it. So now one needs only to download the relevant Wikipedia history for a page and run the text through the sentiment analysis using the computer code I already have developed.

 

The figure shows the sentiment for Ellen Trane Nørby’s Danish Wikipedia article through time. The largest positive jump in sentiment (the way that I measure it) comes from a user inserting content on 2 March 2011. This revision inserts, e.g., “great international commitment” and “impressive election”. Journalist Lars Fogt identified the user as Ellen Trane Nørby staff.

 

Surely the simple word list approach does not work well all the time. The second largest positive jump in sentiment arise when a user deletes a part of the article for POV reasons. That part contained negative words such as svag (weak), trafficking and udsatte (exposed). The simple word list detects the deletion of the words as a positive event. However, the context which they appeared in was actually positive, e.g, “… Ellen Trane Nørby is a socially committed politician, who also fights for the weak and exposed in society, …”.

 

As far as I understand journalist Lars Fogt used the Danish version of the Wikipedia Scanner provided by Peter Brodersen, see the list generated for Ellen Trane Nørby. Brodersen’s tool does not (yet?) provide automated sentiment score, but does a good job in providing an overview of the edit history.

(2013-01-16: typo correction)

NumPy beginner’s guide: Date formatting, stock quotes and Wikipedia sentiment analysis

Posted on Updated on

Nielsen2012numpy

Last year I acted as one of the reviewers on a book from Packt Publishing: The NumPy 1.5 Beginner’s Guide (ISBN 13 : 978-1-84951-530-6) about the numerical programming library in the Python programming language. I was “blinded” by the publisher, so I did not know that the author was Ivan Idris before the book came out. For my reviewing effort I got a physical copy of the book, an electronic copy of another book and some new knowledge of certain aspects of the NumPy.

One of the things that I did not know before I came across it while reviewing the book was the date formatter in the plotting library (matplotlib) and the ability to download stock quotes via a single function in the NumPy library (there is an example starting on page 171 in the book). There is a ‘candlestick’ plot function that goes well with the return value of the quotes download function.

The plot shows an example of the use of date formatting with stock quotes downloaded from Yahoo! via NumPy together with sentiment analysis of Wikipedia revisions of the Pfizer company.

WikiViz: create the most insightful visualization of Wikipedia???s impact

Posted on

The WikiViz challenge has now been officially announced. The challenge is to create the most insightful visualization of Wikipedia’s impact

“The main goal of this competition is to improve our understanding of how Wikipedia is affecting the world beyond the scope of its own community.”

There are more details here.

Any inspirations? One of the organizers is Dario Taraborelli from the Wikimedia Foundation (he is one of guys behind the not necessarily useful but ridiculously aesthetic Wikipedia deletion discussion visualization). He has made the Readermeter Mendeley readership analysis and visualization website. So maybe you could do something similar with the Wikipedia reader statistics at http://stats.grok.se/ or the raw data here? hmmm… It may be a good idea to take a look at the First Monday article Visualizing the overlap between the 100 most visited pages on Wikipedia for September 2006 to January 2007. See also Google image search for Wikipedia visualization.

Who knows Who knows? You can now play a Facebook application while doing research

Posted on Updated on

Whoknows

On 8th Extended Semantic Web Conference researchers from Potsdam showed a Facebook application. It is a quiz game and is called Who Knows?

The special thing about it is that the questions are automatically generated from Wikipedia via DBpedia. As users’ interaction with the game is recorded the result may be used to improve the ranking of triple data in Semantic Web applications as well as find errors in Wikipedia/DBpedia.

The background scientific paper is WhoKnows? – Evaluating Linked Data Heuristics with a Quiz that Cleans Up DBpedia. Last author Harald Sack is presently on the top of the high score list.

Another of their Facebook quiz applications is Risq.

Hunting down the undead ghost of classical conductor George Richter

Posted on Updated on

Hansrichter1876

At one point in life I acquired myself a CD with famous works of Edward Elgar: Pomp and Circumstance, Nimrod, Sospiri, the Cello and all that. I found the recording fairly good. The cover stated that the conductor was George Richter handling the London Symphony Orchestra. Googling my way on George Richter I couldn’t say that I found much. I found several references to CDs but not much about the man. London Symphony Orchestra has had a Richter as conductor: Hans Richter. Perhaps George was related? But I could not find any information about that.

I then on 6th December 2009 added George Richter to Wikipedia in the hope that someone would seek more sources. But on the 16 May 2011 a fierce Wikipedia deletionist came by the article threatening to kill poor George due to lack of references, – a cardinal sin for articles about living persons on Wikipedia. Interesting though, the deletionist had diligently discovered one single reference through Google Books: To Jonathan Brown’s (who is he?) book of 2000 “Tristan und Isolde on Record. A comprehensive discography of Wagner’s music drama with a critical introduction to the recordings.”

And now comes the spooky part.

On page xiii Brown describes our George as an “apocryphal conductor”. So what do that mean? That George didn’t get to join in on the Bible along with Mary, Moses and the rest of the band? No. As further on page 215 Brown states that one Wagnerian Richter recording is actually Heinrich Hollreiser and the recording is not as stated with London Symphony Orchestra but rather with Bamberg Symphony Orchestra. Brown – the Wagnerian discographier – had timed the different recordings of Wagner and found that Hollreiser’s recording has been issued under a number of other names: Heinrich Heller, Hans Burg, Ralph deCross, Otto Friedlich, Karl Ritter, Leon Szalar and our George Richter. Likely it seems that also the Elgar recording is not by George Richter but another yet unidentified conductor, perhaps Hollreiser?

Apparently George Richter is an invented name. Why?

My Elgar CD brands itself as an “Apollo Classics” and the company issuing the CD is “Wisepack Ltd.” in 1995. Tracking this company I find that Business Directory North Central London records Wisepack Ltd.’s address as “Unit 12. Latimer Road. London. W10 6RG” with a “PIN Tel.” 0904 049 8229. Their business is “production of records tapes and CDS”. So we are nearing. UK company search Companies House lists two entries for Wisepack: “Wisepack (1992) Limited” incorporated the 24 January 1992 and dissolved 2 April 1996 company number 02680885 with an unknown nature of business. The other entry is and “Wisepack Limited” incorporated 1988 and dissolved 29 September 1998 with company name 02245831 and on the Latimer Road address. Their nature of business is stated as “Publishing of sound recordings”, “Reproduction of sound recording” and “Wholesale electric household goods”. Peeking with Google Street Viewwe might get a glance at the Unit 12 address as it looks today. As far as I can identify the Unit 12 address today is the one with the company “city electrical factors ltd” who describe themselves as “electrical wholesalers, suppliers of electrical equipment”.

Why would the Wisepack company invent a name and not attribute the recording to the right conductor?

Jonathan Brown may give us a hint. Google Books does not show page 216 of the book but another page on the Internet may have listed the information from the Brown book. It reads: “The absence of copyright restrictions may explain why this recording has been issued under so many fictitious names”. My take on this is: The recordings with Wagner and Elgar are in the public domain and several record publishers have taken advantage of this and reissued the recordings. They change the attribution to hide the source of the original recording and thereby inventing the ghost of George Richter. We are dealing with some copyright hanky-panky.

Whether this hypothesis is correct I do not know. In support I would say that the the Wisepack logo looks like something done in DrawPerfect, – not a logo from highly esteemed company. Brown has some reservation about the identification of apocryphal conductors “because there remain a number of recordings attributed to conductors about whom very little, if anything is known”. Still I say we are dealing with a ghost. And more ghosts to come. The Elgar cello concert has a Veronique Desbois at the soloinstrument. She is likely a ghost too.

The ghost of George Richter has also conducted the “Royal Danish Symphony Orchestra” in works by Smetana and Rimsky-Korsakov as well as an overture by Gioacchino Rossini. There are two major orchestras in Copenhagen. In English they are called Danish National Symphony Orchestra and The Royal Danish Orchestra, so which one has the ghost conducted? The Rimsky-Korsakov piece is issued by the Sine Qua Non label that belongs in One Charles Street, Providence, RI, USA. A version of Elgar is apparently also issued by the GR8 label under the brand “Spectacular Classics” (wow what a name!). George Richter continues to issue CDs. As recent as in 2005 a Beethoven CD was published. Here Richter conducted London Symphony Orchestra. You can buy a Richter-Smetana CD at Amazon. This CD also has the work of conductor Henry Adolph, – another ghost according to a Anton Bruckner site

Strange things are going on in classical music. One may begin by reading the Wikipedia article about British record producer William Barrington-Coupe who according to a judge was involved in “blatant and impertinent frauds, carried out in my opinion rather clumsily.” One of his schemes exposed in 2007 involved unauthorised copies of commercial recordings. These were rerelease under his wife’s name, Joyce Hatto – and highly acclaimed. Barrington-Coupe and Hatto are real people – nonghost – though one of them is dead. The conductor in the fraud scheme is holocaust survivor René Köhler. He is likely a ghost – an invention of Barrington-Coupe – and died in 2002. The death of the George Richter has not been announced, so we may continue to hear recordings from this undead ghost, – if he is a ghost. ;-)

(2011-09-21. Minor change: spell correction)

On the completeness of completeness of Wikipedia

Posted on Updated on

Royal2008whats

In a 2008 paper with the running title Completeness of Information on Wikipedia and from Social Science Computer Review the two authors Cindy Royal and Deepina Kapila examine how the lengths of sets of Wikipedia articles compare with other variables: year, column inches in corresponding article in Micropaedia of Brittanica, population of country and company revenue. The full title of the study is What’s on Wikipedia, and what’s not…? assessing completeness of information.

I have been trying to discuss with myself what to think about this article and we have not yet reached a conclussion.

Let me first summarize the article:

A 2008 study compared the number of words in sets of Wikipedia articles with the year associated with the articles and found that articles associated with recent years tended to be longer, i.e., recency was to a certain extent a predictor for coverage: The length of year articles between 1900 and 2008 and the year as a predictor variable had a Spearman correlation coefficient on 0.79. The results were not homogeneous as the length associated with articles for Time‘s person of the year had a correlation of zero with the year. Academy award winning films and ‘artist with #1 song’ had correlation between the two: 0.47 and 0.30, respectively. The authors of the study also examined other sets of articles in Wikipedia and the correlation with column inches in Micropaedia of the Encyclopedia Britannica, country population and company revenue. The correlations were 0.26, 0.55 and 0.49, respectively. In their comparison with 100 articles from Micropaedia they found that 14 of them had no Wikipedia entry, e.g., ‘Russian Association of Proletariat’, ‘League for the Independence of Vietnam’ and ‘urethane’.

I have made an entry on the Brede Wiki for this article.

  • I think it is an interesting methodology to use the length of the article and see how it correlated with other variables such as year, population and revenue.
  • They write that “Urethane” did not have a Wikipedia entry. However, as far as I can determine “Urethane” initial version is from 12 May 2006, Carbamate from 20 May 2004, Ethyl carbamate from 14 October 2003 and Polyurethane from 9 April 2002.
  • The authors write in the discussion “…it was clear that the more common or popular terms had the most detailed coverage”. This is not covered in the results.
  • In relation to country population the authors write “…the democratic nature of Wikipedia on its own cannot counteract the effects of the magnitude of people who are available to participate”. But there is no discussion on whether a country such as Nauru should have the same sized article as the article about India. To me it is most “democratic” if India has a larger article than Nauru, – not that they have the same size.
  • For companies (Fortune 1000): “…this points to the strength of financial power in circumventing any type of democratizing feature of an online space”. But it is not at all clear that small companies should have the same sized articles as big companies. Should the Wikipedia article for the fish shop on Lyngby Hovedgade have the sized article as Lego? No! Does this issue circumvent democracy? No!
  • The Law of Steve Lawrence states that “everything looks like a straight line in a double logarithmic plot”. Now the authors only use semilogarithmic plots, but there is still the issue that the determination of “long-tailedness” is somewhat subjective. To me the “L” and “long-tailedness” in the curve of their figure 1a is not so evident.
  • “There is a clear progression of the length of each article, with a dramatic increase occurring starting in 2001″ is stated in the result section. Here the authors fail to mention that the most recent years shows a significant drop. In the analysis I performed I see (apart from year 2008) a trend decrease from 1967. The plot is shown above and is performed on characters rather than words.

Pew survey: 42% of adult Americans use Wikipedia

Posted on

Pew Research Center’s Interet & American Life Project has conducted an interesting telephone-based survey about Internet and Wikipedia use in spring 2010. The report with the results was published around the 10 year anniversary of Wikipedia in January 2011. They have a previous report from 2007. They report that 42% of adult Americans used Wikipedia in May 2010, up from 25% in 2007. If we linearly extrapolate then 110% of adult Americans will be using Wikipedia in 13 years.

If you are a young white man with a high education level, have a broadband connection and a good income but not the highest it is likely that you use Wikipedia. On the other hand if you are an old Hispanic woman with no high school education and a low income sitting with a dial-up connection then you are less likely to use Wikipedia.

One thing that surprised me was how little difference there was between male and female users of Wikipedia. Among Internet users 56% of males use Wikipedia while the corresponding figure for females is 50%. These percentages are for readers. I suppose the males are more active as writers – from my personal experience. It is also what the Wikipedia Survey finds (page 7): Only 13% of contributors to Wikipedia (that took the survey) are female.

There are a few things I don’t understand. They report the Wikipedia use among Internet users to be 79%. In the methodology section they report a sample size on 2’252 and 1’756 Internet users. If you divide the two numbers you get 1756/2252 = 0.77975 which is nearer to 78% than 79%. Another strange issue is that there are 1756 Internet users (according to the methodology) while for the characterization of the demographics of Wikipedia users there are only 852 Internet users. They report they called 33’594 phone numbers and got a “response rate” around 20%. 20%
of 33’594 gives around 6’700 which is not 2’252. So where is the rest lost? Perhaps somewhere around the “completion rate”, “eligibility rate” and “cooperation rate”? Could the 78/79% issue be related to telephone interview debias weighting…?

The data is available on their homepage.