What is ‘big data’: a definition

Posted on

What is ‘big data’? How big does ‘big data’ have to be before it is ‘big data’. Here is my attempt on a definition:

  • ‘Small data’ is of a size such that it can easily be handled in a general purpose spreadsheet program. It is possible to navigate the data to get an overview of the data rows and columns as well as immediately look on each element. An example of a ‘small data’ data set Iris flower data set with 150 example and 5 columns. Another one is the Pima Indians Diabetes Data Set.
  • ‘Medium data’ is data that fits within a single machine, but cannot be handled with a general purpose statistical program. It is difficult to get an overview of the data. A typical ‘medium data’ data set size may be several hundreds of megabytes.
  • ‘Big data’ is data where the reading of the data presents a problem. All data cannot be handled on one computer but one needs to iterate over the data samples when analysing the data. In the 1990s a positron emission tomography neuroimaging data sets with 20 subjects would be an example of ‘big data’: To do principal component analysis we would iterate over each subject to iteratively build the covariance matrix before a eigenvector analysis. Now such sized data I would consider ‘medium data’. Such data now fits into a single matrix where a plentora of algorithms can analyze it directly.

Different technologies may blur the boundary making the iteration transparent. Spark SQL allows one to make SQL queries on distributed data sets making it look like – from the data analyst point of view – that a ‘big data’ data set is a ‘medium data’ data set.

This definition differs from Matt Hunt from Bloomberg. His definition is “‘Medium data’ refers to data sets that are too large to fit on a single machine but don’t require enormous clusters of thousands of them: high terabytes, not petabytes.” In their book Mayer-Schönberger and Kenneth Cukier regard three distinguishing features of ‘big data’: “more”, messy and correlation rather than causality. Other definitions can be read off from Wikipedia’s Big data article.


Denial of Service crawl on the Brede Wiki?

Posted on

Just as I was about to download a meta-analytic comma-separated values file from the Brede Wiki my server with the wiki got in deep trouble. Though there was some respons it was really slow. I had to do a hard reset. When I looked in the log files I could see something like “trx0undo.c … Mutex at … created file trx0rseg.c” and “InnoDB: Warning: a long semaphore wait”. I had a similar problem yesterday.

I was afraid that this might be a harddisk issue, but the harddisk utility command “smartctl -a /dev/hda1” said nothing.

If one googles with the error message a few bugs and questions shows up, but apparently not something that could help me.

Then I looked in the Apache log (/var/log/apache2/access.log) I could see aggressive download from a specific foreign university computer with several request made per second at around the time when the server got into trouble. So it might be that MediaWiki/MySQL has a problem there – not being able to handle that amount of requests. I wrote the following email to the university department:

Dear … of Computer Science,

I am recording aggressive downloads on my Web server from 999.999.999.999 which resolves to …, so it must be a computer at your site.

The amount of downloads unfortunately make my server stall, – it is a rather old computer that cannot handle much load. It is probably a bot (perhaps constructed by one of your students) that has been setup to crawl my site. I hope you can contact the person who is responsible for the bot and ask him to moderate the download rate. At the moment I am getting several request per second from the 999.999.999.999 computer.

The person behind the bot has set the agent field wrong. At the moment it display “firefox 3.0” which I very much doubt.

If it is not possible for you to contact the person I might have to setup a firewall item disabling the University of … to access my Web server.


I now also added “Crawl-delay: 3″ to the robots.txt file. I do not know how well different crawlers implement that directive.

If it is the case that the request rate has caused the problem I am a bit puzzled that MediaWiki/MySQL cannot handle that rate. It is a fairly old computer, but it should fail gracefully. Maybe I need to go over the configuration. I suppose the issue might be around “$wgDisableCounters” that I believe must require a write during the reading process. It is nice to have the download statistics but not essential.

Does Yandex honor robots.txt?

Posted on

I have setup arobots.txtwith “User-agent: *” and appropriate Disallow, but I discovered in my logthat the Apache2 server was under heavy load from the bots of Russiansearch engine Yandex. Is it me who have setup the robots.txt wrongly? Asfar as I can see no other bots get to the place I do not want to becrawled.

People on the internet suggest “User-agent: Yandex” and disallow rightafter, but others claim that Yandex does not look at robots.txt andsuggest putting the following in the .htaccess file in the document root(usually /var/www/):

SetEnvIfNoCase User-Agent "^Yandex*" bad_bot Order Deny,Allow Deny from env=bad_bot

This seems to work for me, although I also needed something like“AllowOverrideall”in the configuration file usually found in thedirectory /etc/apache2/sites-available/

So this is one of the silly things you can spend your life on.