How to read all of Wikipedia in half an hour

Posted on Updated on

Some may wonder how you are able to read Wikipedia from the huge dumps that are made available from http://download.wikipedia.org. The present file for the current English Wikipedia is 5.6 GB, – and that is compressed! The tricks I used for the article Scientific citations in Wikipedia published in First Monday in 2007 was bzcat, pipes and Perl. In Perl it is relatively easy to call the decompression program bzcat and then capture its output within Perl “as you go”. It is not necessary to decompress the entire file to one very large file. By redefining the input record separator I iterate over each wiki page, and then perform some extraction on each page, – in my case looking for the ‘cite journal’ template.

The original program, slightly edited, is listed below. As far as I remember it toke around half an hour to run the program.

#!/usr/bin/perl -w

use open ':utf8';
use English; 
if (@ARGV) {   $archive = $ARGV[0]; } else {  $archive = "enwiki-20080312"; }  $filenameInput = $archive . "-pages-articles.xml.bz2"; $filenameOutput = $archive . "-cite-journal.txt"; $filenameTitles = $archive . "-titles.txt";
$INPUT_RECORD_SEPARATOR = "";  
open($fileInput, "bzcat $filenameInput |") || die("can't open file ($filenameInput): $!"); 
open($fileOutput, "> $filenameOutput") || die("can't open file ($filenameOutput): $!"); 
open($fileTitles, "> $filenameTitles") || die("can't open file ($filenameTitles): $!");  
$pagenumber = 1; while (<$fileInput>) {
    # Match "Cite journal" template  
    @citejournals = m/({{s*cite journal.*?}})/sig;
    @titles = m|(.*?)|;
    $titles[0] =~ s/s+/ /g;
    # Remove consecutive whitespaces and print to file
    foreach $citejournal (@citejournals) {
          $citejournal =~ s/s+/ /g;
          print $fileOutput "$pagenumber: $citejournaln";
          print $fileTitles "$pagenumber: $titles[0]n";
    }
    $pagenumber++;
}  
close($fileInput);
close($fileOutput);
close($fileTitles);

You can consider the code as under GPL. If you are a scientist and use it I would be glad if you also could site the associated First Monday paper. The program was also used for Clustering of scientific citations in Wikipedia.

Good luck with Wikipedia mining!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s