Month: February 2010

How to read all of Wikipedia in half an hour

Posted on Updated on

Some may wonder how you are able to read Wikipedia from the huge dumps that are made available from The present file for the current English Wikipedia is 5.6 GB, – and that is compressed! The tricks I used for the article Scientific citations in Wikipedia published in First Monday in 2007 was bzcat, pipes and Perl. In Perl it is relatively easy to call the decompression program bzcat and then capture its output within Perl “as you go”. It is not necessary to decompress the entire file to one very large file. By redefining the input record separator I iterate over each wiki page, and then perform some extraction on each page, – in my case looking for the ‘cite journal’ template.

The original program, slightly edited, is listed below. As far as I remember it toke around half an hour to run the program.

#!/usr/bin/perl -w

use open ':utf8';
use English; 
if (@ARGV) {   $archive = $ARGV[0]; } else {  $archive = "enwiki-20080312"; }  $filenameInput = $archive . "-pages-articles.xml.bz2"; $filenameOutput = $archive . "-cite-journal.txt"; $filenameTitles = $archive . "-titles.txt";
open($fileInput, "bzcat $filenameInput |") || die("can't open file ($filenameInput): $!"); 
open($fileOutput, "> $filenameOutput") || die("can't open file ($filenameOutput): $!"); 
open($fileTitles, "> $filenameTitles") || die("can't open file ($filenameTitles): $!");  
$pagenumber = 1; while (<$fileInput>) {
    # Match "Cite journal" template  
    @citejournals = m/({{s*cite journal.*?}})/sig;
    @titles = m|(.*?)|;
    $titles[0] =~ s/s+/ /g;
    # Remove consecutive whitespaces and print to file
    foreach $citejournal (@citejournals) {
          $citejournal =~ s/s+/ /g;
          print $fileOutput "$pagenumber: $citejournaln";
          print $fileTitles "$pagenumber: $titles[0]n";

You can consider the code as under GPL. If you are a scientist and use it I would be glad if you also could site the associated First Monday paper. The program was also used for Clustering of scientific citations in Wikipedia.

Good luck with Wikipedia mining!

Installing Ubuntu on a new Acer Aspire One N450

Posted on

Installing Linux distributions is not always a easy task. Back in October 2009 I wrote about my problems updating an old computer to Debian Lenny. Home I got an stationary computer that dates almost back to the previous millennium. I needed to upgrade that old and slow computer to the present Debian stable from Debian oldstable. In the course of the upgrade I got into troubles and the computer now is quite difficult to boot. In year 2009 I could bring home my portable computer from work, a Dell D420 laptop, but since the introduction of the Danish infamous ‘Multimedieskat’ taking the computer home now costs around 1,500 Danish Crowns each year in extra tax. So I was slightly computer-less and wanted a new computer. I bought an Acer Aspire One netbook with an Intel Atom N450 1.66GB processor (3325.31 bogomips), 1GB RAM, 160GB hard disk, 10.1 inch screen. Several other computer manufactures had models with almost the same configuration, only a larger 250GB hard disk and the design seemed to distinguish the models.

In Denmark it is difficult to avoid buying Microsoft Windows when you buy a new computer. Almost all computers are sold with Windows. Presently, Danish overnerd Poul-Henning Kamp is leading a court battle trying to get a refund from an unnecessary Microsoft Windows Vista license that was attached to his Lenovo. The Acer One that I bought had Microsoft Windows 7 Starter ready for installation.

To see how much effort actually went into installing Linux I timed the process, and here are my notes:

  • 0:00 Unpacking, while trying to find information (i.e. googling) about Linux swap space on the hard disk. Imaging that I should use the standard 2 x the amount of RAM memory…
  • 0:08 Windows installation.
  • 0:18 Installing Windows 7 Started. ‘Configure System Settings’. It here turns out that Windows is not installed but part of the hard disk is allocated to what in the old days would have been an installation CD.
  • 0:50 Finished installation of Microsoft Windows 7 Starter
  • 1:00 Checked that external screen worked at 1900×1080. DR Update worked in fullscreen. Slow. Webcam works. The hard disk was sold to have 160 GB. Now it reports 122 GB free and 136 GB total. The Microsoft installation have taken up quite a bit.
  • 1:10 Rebooting into Ubuntu from a USB stick with a ‘live CD’ that I ‘burnt’ before Christmas. Didn’t catch BIOS. hmmm… it boots into Windows. Rebooting and pressing DEL or F2 continuously.
  • 1:26 Now booting from Ubuntu 9.10 USB stick. Wireless works. Sound out works. External display works. Not in full resolution. Looking around.
  • 1:42 Ubuntu Install. Keyboard layout set to Danish.
  • 1:45 Partitioning. Ubuntu installer does suggestions for partitioning. I see that part of the hard disk is taken up by the Microsoft installation. Deletes all Microsoft. Ubuntu suggests an ext4 file system and swap (so my googling for information about swap space was not that necessary). I was a bit reluctant here: I erased the entire Microsoft installation – together with the part of hard disk that had the ‘installation CD’. In my previous computers I installed Windows together with Linux, although I almost didn’t use Windows. Erasing Windows will give my some extra hard disk space, but with ‘no way back’.
  • 2:11 Ubuntu installed and in Ubuntu. External mouse is slow. In 1900×1080. Cannot close the lid and run on the external screen.
  • 2:40 Continue installation. Crash at the install of Emacs. Unstable wireless.
  • 2:56 Mouse now runs better. 212 packages for upgrade. New kernel 2.6.31-17.54.
  • 3:03 Restarting after kernel and grub upgrade.
  • 3:43 Installed cvs, zynaddsubFx. Camera works with the cheese program. Surprisingly the microphone works (with gnome-sound-recorder 2.28.1) even though people on the Internet have reported problems. Installed Latex.

So around an hour to install Windows – that I later would erase – and two hours to install (with overhead) a basic version of Ubuntu.

After these steps of installation I still have issues e.g., setting up of secure shell passwords, copying of cvs repositories, Java installation, setting the environment variables so Emacs and terminals know where my Latex style file are, setting up the Gnome desktop with multiple workspaces and edge flip.

The installation process went better than my expectations. In the old days setting up the obscure XF86Config could take many days – at least for me.

After a few weeks of experience with the new Acer One running Ubuntu Karmic I have the following observations:

  • It runs around 5 hours on batteries :-) It usually reports under 10 Watts in usage. I wonder if that is due to the new N450 processor?
  • It has no apparent problems in making desktop OpenGL effects on a 1920×1080 screen. :-)
  • It has problems with showing video on external 1920×1080, e.g. YouTube fullscreen. I guess the graphics card is not fast enough.
  • The wireless network is unstable :-( I don’t know what this is. It usually happens with large downloads, and log messages “ath9k: DMA failed to stop in 10 ms” are my only hook on this problem.
  • The power button is under the lid, – not outside :-(. My Dell 420 laptop has the button outside, so when attached to an external screen, keyboard and mouse I do not need to open it to switch it on.
  • No ‘network off’ button which might be a problem in an airplane :-(
  • Two of the USB connectors are positioned quite close :-( Thick USB sticks cannot sit next to each other. There are, however, one connector sitting by itself on the right side of the computer.
  • Switching screens is badly handled. Sometimes I get two black screens. Fn+F5 and ‘sk??rmindstillinger’ do something but not necessarily the right thing.
  • The desktop is sometimes unstable with occasionally Ctrl+Alt+F1 needed.

And the problems are endless: Now my upper and lower edge flip on workspaces has now gone. I also tried installing Ubuntustudio with realtime kernel, jack and all that, but that a longer story.

But all in all I would say the system is reasonably OK.

Together with the netbook I bought a USB computer mouse and a USB qwerty keyboard – the cheapest I could find. The mouse is a small Logitech (M-U0017?) which is OK, though the wire is awfully short, so the computer cannot be far from the table. The keyboard is a compact Logitech ‘Ultra-Flat Keyboard’, and hmmm… I am not that satisfied with it since the configurations of the keys are compact and awkward. There is no space between the keys except between the numerical keys and the rest of the keyboard. For example the arrow keys have no space around them so it is difficult to feel where they are. You have to look. ‘Page Up’ and ‘Page Down’ are long away from where they use to be. I have been looking for another keyboard that is both reasonable compact and have a standard configuration. It seems that most stores in my area all carry the same line of keyboards: Either from Microsoft or Logitech and none of fully fit my wishes.

Backward matching with regular expressions

Posted on

# -*- coding: utf-8 -*-

s = """I wan't to do 'backward matching': In my case match a series of  numbers backwards, e.g., with "1 2 3 4" I want to pick out  "2 3 4" rather than "1 2 3" with a regular expression such as  "d+s+d+s+d+". Within Python and using re.findall() the  na??ve application gets you "1 2 3". One way would be to reverse  the string - in Python with s[::-1] - and write the regular  expression 'backwards'. However, a solution that seems less  complex simply extends the regular expression repeating the  subpattern with a 'match 0 or more time' preamble. In my case  that would read '(?:d+s+)*(d+s+d+s+d+)'. The code below  behaves properly with the string that this sentence is part of.  """  import re pattern1 = "d+s+d+s+d+" print(re.findall(pattern1, s))  pattern2 = "(?:d+s+)*(d+s+d+s+d+)" print(re.findall(pattern2, s))  #