On the completeness of Wikipedia

In a 2008 paper with the running title Completeness of Information on Wikipedia and from Social Science Computer Review the two authors Cindy Royal and Deepina Kapila examine how the lengths of sets of Wikipedia articles compare with other variables: year, column inches in corresponding article in Micropaedia of Brittanica, population of country and company revenue. The full title of the study is What’s on Wikipedia, and what’s not…? assessing completeness of information.

I have been trying to discuss with myself what to think about this article and we have not yet reached a conclussion.

Let me first summarize the article:

A 2008 study compared the number of words in sets of Wikipedia articles with the year associated with the articles and found that articles associated with recent years tended to be longer, i.e., recency was to a certain extent a predictor for coverage: The length of year articles between 1900 and 2008 and the year as a predictor variable had a Spearman correlation coefficient on 0.79. The results were not homogeneous as the length associated with articles for Time‘s person of the year had a correlation of zero with the year. Academy award winning films and ‘artist with #1 song’ had correlation between the two: 0.47 and 0.30, respectively. The authors of the study also examined other sets of articles in Wikipedia and the correlation with column inches in Micropaedia of the Encyclopedia Britannica, country population and company revenue. The correlations were 0.26, 0.55 and 0.49, respectively. In their comparison with 100 articles from Micropaedia they found that 14 of them had no Wikipedia entry, e.g., ‘Russian Association of Proletariat’, ‘League for the Independence of Vietnam’ and ‘urethane’.

  • I think it is an interesting methodology to use the length of the article and see how it correlated with other variables such as year, population and revenue.
  • They write that “Urethane” did not have a Wikipedia entry. However, as far as I can determine “Urethane” initial version is from 12 May 2006, Carbamate from 20 May 2004, Ethyl carbamate from 14 October 2003 and Polyurethane from 9 April 2002.
  • The authors write in the discussion “…it was clear that the more common or popular terms had the most detailed coverage”. This is not covered in the results.
  • In relation to country population the authors write “…the democratic nature of Wikipedia on its own cannot counteract the effects of the magnitude of people who are available to participate”. But there is no discussion on whether a country such as Nauru should have the same sized article as the article about India. To me it is most “democratic” if India has a larger article than Nauru, – not that they have the same size.
  • For companies (Fortune 1000): “…this points to the strength of financial power in circumventing any type of democratizing feature of an online space”. But it is not at all clear that small companies should have the same sized articles as big companies. Should the Wikipedia article for the fish shop on Lyngby Hovedgade have the sized article as Lego? No! Does this issue circumvent democracy? No!
  • The Law of Steve Lawrence states that “everything looks like a straight line in a double logarithmic plot”. Now the authors only use semilogarithmic plots, but there is still the issue that the determination of “long-tailedness” is somewhat subjective. To me the “L” and “long-tailedness” in the curve of their figure 1a is not so evident.
  • “There is a clear progression of the length of each article, with a dramatic increase occurring starting in 2001” is stated in the result section. Here the authors fail to mention that the most recent years shows a significant drop. In the analysis I performed I see (apart from year 2008) a trend decrease from 1967. The plot is shown above and is performed on characters rather than words.

