python

Virtual machine for Python with Vagrant

Posted on Updated on

I may have managed to setup a virtual machine with Vagrant, partially following instructions from the vagrant homepage.

$ sudo aptitude purge vagrant virtualbox virtualbox-dkms virtualbox-qt
$ locate Vagrantfile
$ rm -r ~/.vagrant.d/
$ rm -r ~/virtual
$ rm ~/Vagrantfile 

$ sudo aptitude install vagrant
$ vagrant init hashicorp/precise32
$ vagrant up
There was a problem with the configuration of Vagrant. The error message(s)
are printed below:

vm:
* The box 'hashicorp/precise32' could not be found.

$ vagrant box add precise32 http://files.vagrantup.com/precise32.box
$ ls -l .vagrant.d/boxes/precise32
total 288340
-rw------- 1 fnielsen fnielsen 295237632 Oct  3 15:58 box-disk1.vmdk
-rw------- 1 fnielsen fnielsen     14103 Oct  3 15:58 box.ovf
-rw-r--r-- 1 fnielsen fnielsen       505 Oct  3 15:58 Vagrantfile

$ vagrant up
There was a problem with the configuration of Vagrant. The error message(s)
are printed below:

vm:
* The box 'hashicorp/precise32' could not be found.

$ vagrant box remove precise32
$ vagrant box add precise http://cloud-images.ubuntu.com/vagrant/precise/current/precise-server-cloudimg-i386-vagrant-disk1.box
$ rm Vagrantfile
$ vagrant init precise
$ vagrant up
$ vagrant ssh 
$ uname -a
Linux vagrant-ubuntu-precise-32 3.2.0-69-virtual #103-Ubuntu SMP Tue Sep 2 05:28:41 UTC 2014 i686 i686 i386 GNU/Linux
$ whoami
vagrant


$ sudo aptitude install python-pip
$ sudo pip install numpy
$ sudo aptitude install python-dev
$ sudo pip install numpy
$ python
>>> import numpy
>>> f = open('Hello, virtual world.txt', 'w')
>>> f.write('Hello, virtual world')
>>> f.close()
>>> exit()
$ strings ~/VirtualBox\ VMs/fnielsen_1412345235/box-disk1.vmdk | grep 'Hello, virtual world.txt'
Hello, virtual world.txt
Hello, virtual world.txt

Somewhere inbetween I erased old Virtualbox files in “VirtualBox VMs” directory: “rm -r test_1406195091/” and “rm -r pythoner/”.

Advertisements

Zipf plot for word counts in Brown corpus

Posted on

Image

There are various ways of plotting the distribution of highly skewed (heavy-tailed) data, e.g., with a histogram with logarithmically-spaced bins on a log-log plot, or by generating a Zipf-like plot (rank-frequency plot) like the above. This figure uses token count data from the Brown corpus as made available in the NLTK package.

For fitting the Zipf-curve a simple Scipy-based approach is suggested on Stackoverflow by “Evert”. More complicated power-law fitting is implemented on the Python package powerlaw described in Powerlaw: a Python package for analysis of heavy-tailed distributions that is based on the Clauset-paper.

You can’t fool Python

Posted on

There is this funny thing with Python that allows you to have static variables in functions by putting a mutable object as the default argument.

In Ruby default arguments are evaluated each time the function is called (I am told), so you can make recursive calls with two ruby functions calling each other with the default input arguments:

Ruby complains that the stack level becomes too deep.

In Python the default argument is evaluated once when the function is defined, so the result of calling one of the Python functions will be different than calling one of the Ruby functions.

Small solutions for big data and python shelve concurrency

Posted on Updated on

I am still on the lookout for a good database system: Movable, big, concurrent, fast, flexible and not necessarily requiring root access.

MySQL, good in many aspects, lacks flexibility: An ALTER TABLE can take hours.

MongoDB has a 2GB size limit on 32-bit.

For some reason I thought that SQLite was limited to 2GB on 32-bit (where on earth did I get that idea from?). But SQLite can potential store 140 terabytes. It may be limited by OS/filesystem. So what is that? 32-bit ext3 file size limit is from 16GiB to 2TiB says Wikipedia. Apparently my block sizes are 4KiB (reported with $ sudo /sbin/dumpe2fs /dev/sda7 | grep “Block size”), so if we can trust this online encyclopedia that anyone can edit it may be that I can have 2TiB SQLite databases. SQLite still has the ALTER TABLE problem, but my first attempt used SQLite as a key-value store with the values as JSON. News on Wikipedia also reports that Mr. Hipp is working on document-oriented UnQLite.

I was also considering the Python key-value store ‘shelve’ and its underlying databases (e.g., bsddb). However, somewhere in the documentation you can read that “The shelve module does not support concurrent read/write access”. I was slightly surprised by how wrong it goes when I executed the code below.

CherryPy vs Tornado benchmarking

Posted on Updated on

CherryPy is a Python-based web framework enabling you to make a dynamic web service without much setup and configuration. It comes with its own web server and a “Hello, World” can be constructed in six lines. The default setup might not be that fast, but it may be possible to speed it up, see Running CherryPy behind Apache using Mod_WSGI. I haven’t tried that.

Another Python-based web framework is Tornado. Its “Hello, World” is around 17 lines.

Below I have listed the results with Tornado and CherryPy default “Hello, World” based on ab, – Apache HTTP server benchmarking tool.

It seems that Tornado works well with concurrent connections being considerably faster than CherryPy, and on non-concurrent requests Tornado is around double as fast.

Graph spectra with NetworkX

Posted on Updated on

Nielsen2012python_graphspectru

I was looking for a value of how clustered a network is. I thought that somewhere in graph spectrum was a good place to start and that in the Python package NetworkX there would be some useful methods. However, I couldn’t immediately see any good methods in NetworkX. Then Morten Mørup mentioned something about community detection and modularity and I became diverged, but now I am back again at the graph spectrum.

The second smallest eigenvalue of the Laplacian matrix of the graph seems to represent reasonably well what I was looking for. Apparently that eigenvalue is called the Algebraic connectivity.

NetworkX has a number of graph generators, and for small test cases the algebraic connectivity seems to give an ok value for how clustered the network is, – or rather how non-clustered it is.

NumPy beginner’s guide: Date formatting, stock quotes and Wikipedia sentiment analysis

Posted on Updated on

Nielsen2012numpy

Last year I acted as one of the reviewers on a book from Packt Publishing: The NumPy 1.5 Beginner’s Guide (ISBN 13 : 978-1-84951-530-6) about the numerical programming library in the Python programming language. I was “blinded” by the publisher, so I did not know that the author was Ivan Idris before the book came out. For my reviewing effort I got a physical copy of the book, an electronic copy of another book and some new knowledge of certain aspects of the NumPy.

One of the things that I did not know before I came across it while reviewing the book was the date formatter in the plotting library (matplotlib) and the ability to download stock quotes via a single function in the NumPy library (there is an example starting on page 171 in the book). There is a ‘candlestick’ plot function that goes well with the return value of the quotes download function.

The plot shows an example of the use of date formatting with stock quotes downloaded from Yahoo! via NumPy together with sentiment analysis of Wikipedia revisions of the Pfizer company.