Big big data

Posted on Updated on

Big data, one of recent years’ new buzzwords, has now gotten itself a book with said title. Mayer-Schönberger and Kenneth Cukier’s “Big data: a revolution that will transform how we live, work and think” focuses mostly on what businesses can do with big data, and you ain’t gonna find no much material as a technological-oriented data scientist. The book is from 2013 and already seems dated in the light of the Snowden revelations. The authors critique of personal big data collection does not mention the dragnet operations of signal intelligence agencies besides an 8-line William Binney-paragraph.

The authors claim three features of big data (“three major shifts of mindset”): “More”, messy and correlation rather than causality. I am not entirely convinced that these features distinguish big data. Interventional A/B-testing seems at least to some degree to probe causality rather than just correlation. Such tests are continuously done by major Internet companies on unsuspecting users on large scale. Thus I would say big data processing is indeed probing causality. I neither agree that the big data is more messy than old-time small data. Anyone working seriously with small data may easily find the handling of such data can be a considerable headache and require some processing and ‘understanding’. Indeed big data technologies have brought us means for handling messy data in a more structured way (JSON, NoSQL, Semantic Web, Wikidata). The reason why small data may feel less messy could be because the clean-up of small data can be done manually in a spreadsheet by a non-programmer, while for big data you need automatic tools and probably a programmer.

The authors also claim that we will see a rise in the profession called ‘the algorithmist’ whose job it will be to review algorithms. I do not think this is likely. The closest will probably get is the Google advisor board on the ‘right to be forgotten’.

The authors also fail to give us a proper critique of big data hype: Their initial example on Google Flu Trends is dated: A publication from March 2014 shows a wrong flu prevalence estimation from Google Flu Trends (see ‘The Parable of Google Flu: Traps in Big Data Analysis’). The Zeo EEG big data ZEO mentioned in the book hailed back in 2013 as one of the “8 Best Sleep Tracking Apps and Devices” has run out of money, is ‘out of business’ and you won’t find a response from

While the authors tell us that companies collect vast amount of data and that “Companies may be powerful” they ensure us on page 156 that the companies “don’t have the state’s powers to coerce”. Well, yes. But the states have the ability to coerce the company to hand over any personal data. Indeed U.S. companies are coerced to hand over overseas data. Loretta A. Preska of the United States District Court told that to Microsoft. And within the U.S. PRISM program the handover is determined in secret FISA courts.


Review also available on LibraryThing.



One thought on “Big big data

    […] high terabytes, not petabytes.” In their book Mayer-Schönberger and Kenneth Cukier regard three distinguishing features of ‘big data': “more”, messy and correlation rather than causality. Other […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s