Month: April 2015

What is ‘big data’: a definition

Posted on

What is ‘big data’? How big does ‘big data’ have to be before it is ‘big data’. Here is my attempt on a definition:

  • ‘Small data’ is of a size such that it can easily be handled in a general purpose spreadsheet program. It is possible to navigate the data to get an overview of the data rows and columns as well as immediately look on each element. An example of a ‘small data’ data set Iris flower data set with 150 example and 5 columns. Another one is the Pima Indians Diabetes Data Set.
  • ‘Medium data’ is data that fits within a single machine, but cannot be handled with a general purpose statistical program. It is difficult to get an overview of the data. A typical ‘medium data’ data set size may be several hundreds of megabytes.
  • ‘Big data’ is data where the reading of the data presents a problem. All data cannot be handled on one computer but one needs to iterate over the data samples when analysing the data. In the 1990s a positron emission tomography neuroimaging data sets with 20 subjects would be an example of ‘big data’: To do principal component analysis we would iterate over each subject to iteratively build the covariance matrix before a eigenvector analysis. Now such sized data I would consider ‘medium data’. Such data now fits into a single matrix where a plentora of algorithms can analyze it directly.

Different technologies may blur the boundary making the iteration transparent. Spark SQL allows one to make SQL queries on distributed data sets making it look like – from the data analyst point of view – that a ‘big data’ data set is a ‘medium data’ data set.

This definition differs from Matt Hunt from Bloomberg. His definition is “‘Medium data’ refers to data sets that are too large to fit on a single machine but don’t require enormous clusters of thousands of them: high terabytes, not petabytes.” In their book Mayer-Schönberger and Kenneth Cukier regard three distinguishing features of ‘big data’: “more”, messy and correlation rather than causality. Other definitions can be read off from Wikipedia’s Big data article.