I may have managed to setup a virtual machine with Vagrant, partially following instructions from the vagrant homepage.
$ sudo aptitude purge vagrant virtualbox virtualbox-dkms virtualbox-qt $ locate Vagrantfile $ rm -r ~/.vagrant.d/ $ rm -r ~/virtual $ rm ~/Vagrantfile $ sudo aptitude install vagrant $ vagrant init hashicorp/precise32 $ vagrant up There was a problem with the configuration of Vagrant. The error message(s) are printed below: vm: * The box 'hashicorp/precise32' could not be found. $ vagrant box add precise32 http://files.vagrantup.com/precise32.box $ ls -l .vagrant.d/boxes/precise32 total 288340 -rw------- 1 fnielsen fnielsen 295237632 Oct 3 15:58 box-disk1.vmdk -rw------- 1 fnielsen fnielsen 14103 Oct 3 15:58 box.ovf -rw-r--r-- 1 fnielsen fnielsen 505 Oct 3 15:58 Vagrantfile $ vagrant up There was a problem with the configuration of Vagrant. The error message(s) are printed below: vm: * The box 'hashicorp/precise32' could not be found. $ vagrant box remove precise32 $ vagrant box add precise http://cloud-images.ubuntu.com/vagrant/precise/current/precise-server-cloudimg-i386-vagrant-disk1.box $ rm Vagrantfile $ vagrant init precise $ vagrant up $ vagrant ssh $ uname -a Linux vagrant-ubuntu-precise-32 3.2.0-69-virtual #103-Ubuntu SMP Tue Sep 2 05:28:41 UTC 2014 i686 i686 i386 GNU/Linux $ whoami vagrant $ sudo aptitude install python-pip $ sudo pip install numpy $ sudo aptitude install python-dev $ sudo pip install numpy $ python >>> import numpy >>> f = open('Hello, virtual world.txt', 'w') >>> f.write('Hello, virtual world') >>> f.close() >>> exit() $ strings ~/VirtualBox\ VMs/fnielsen_1412345235/box-disk1.vmdk | grep 'Hello, virtual world.txt' Hello, virtual world.txt Hello, virtual world.txt
Somewhere inbetween I erased old Virtualbox files in “VirtualBox VMs” directory: “rm -r test_1406195091/” and “rm -r pythoner/”.
There are various ways of plotting the distribution of highly skewed (heavy-tailed) data, e.g., with a histogram with logarithmically-spaced bins on a log-log plot, or by generating a Zipf-like plot (rank-frequency plot) like the above. This figure uses token count data from the Brown corpus as made available in the NLTK package.
For fitting the Zipf-curve a simple Scipy-based approach is suggested on Stackoverflow by “Evert”. More complicated power-law fitting is implemented on the Python package powerlaw described in Powerlaw: a Python package for analysis of heavy-tailed distributions that is based on the Clauset-paper.
There is this funny thing with Python that allows you to have static variables in functions by putting a mutable object as the default argument.
In Ruby default arguments are evaluated each time the function is called (I am told), so you can make recursive calls with two ruby functions calling each other with the default input arguments:
Ruby complains that the stack level becomes too deep.
In Python the default argument is evaluated once when the function is defined, so the result of calling one of the Python functions will be different than calling one of the Ruby functions.
I am still on the lookout for a good database system: Movable, big, concurrent, fast, flexible and not necessarily requiring root access.MySQL, good in many aspects, lacks flexibility: An ALTER TABLE can take hours. MongoDB has a 2GB size limit on 32-bit. For some reason I thought that SQLite was limited to 2GB on 32-bit (where on earth did I get that idea from?). But SQLite can potential store 140 terabytes. It may be limited by OS/filesystem. So what is that? 32-bit ext3 file size limit is from 16GiB to 2TiB says Wikipedia. Apparently my block sizes are 4KiB (reported with $ sudo /sbin/dumpe2fs /dev/sda7 | grep “Block size”), so if we can trust this online encyclopedia that anyone can edit it may be that I can have 2TiB SQLite databases. SQLite still has the ALTER TABLE problem, but my first attempt used SQLite as a key-value store with the values as JSON. News on Wikipedia also reports that Mr. Hipp is working on document-oriented UnQLite. I was also considering the Python key-value store ‘shelve’ and its underlying databases (e.g., bsddb). However, somewhere in the documentation you can read that “The shelve module does not support concurrent read/write access”. I was slightly surprised by how wrong it goes when I executed the code below.
CherryPy is a Python-based web framework enabling you to make a dynamic web service without much setup and configuration. It comes with its own web server and a “Hello, World” can be constructed in six lines. The default setup might not be that fast, but it may be possible to speed it up, see Running CherryPy behind Apache using Mod_WSGI. I haven’t tried that.Another Python-based web framework is Tornado. Its “Hello, World” is around 17 lines. Below I have listed the results with Tornado and CherryPy default “Hello, World” based on ab, – Apache HTTP server benchmarking tool. It seems that Tornado works well with concurrent connections being considerably faster than CherryPy, and on non-concurrent requests Tornado is around double as fast.
I was looking for a value of how clustered a network is. I thought that somewhere in graph spectrum was a good place to start and that in the Python package NetworkX there would be some useful methods. However, I couldn’t immediately see any good methods in NetworkX. Then Morten Mørup mentioned something about community detection and modularity and I became diverged, but now I am back again at the graph spectrum.
The second smallest eigenvalue of the Laplacian matrix of the graph seems to represent reasonably well what I was looking for. Apparently that eigenvalue is called the Algebraic connectivity.
NetworkX has a number of graph generators, and for small test cases the algebraic connectivity seems to give an ok value for how clustered the network is, – or rather how non-clustered it is.
Last year I acted as one of the reviewers on a book from Packt Publishing: The NumPy 1.5 Beginner’s Guide (ISBN 13 : 978-1-84951-530-6) about the numerical programming library in the Python programming language. I was “blinded” by the publisher, so I did not know that the author was Ivan Idris before the book came out. For my reviewing effort I got a physical copy of the book, an electronic copy of another book and some new knowledge of certain aspects of the NumPy.One of the things that I did not know before I came across it while reviewing the book was the date formatter in the plotting library (matplotlib) and the ability to download stock quotes via a single function in the NumPy library (there is an example starting on page 171 in the book). There is a ‘candlestick’ plot function that goes well with the return value of the quotes download function. The plot shows an example of the use of date formatting with stock quotes downloaded from Yahoo! via NumPy together with sentiment analysis of Wikipedia revisions of the Pfizer company.
I my effort to beat the SentiStrength text sentiment analysis algorithm by Mike Thelwall I came up with a low-hanging fruit killer approach, — I thought. Using the standard movie review data set of Bo Pang available in NLTK (used in research papers as a benchmark data set) I would train an NTLK classifier and compare it with my valence-labeled wordlist AFINN and readjust its weights for the words a little.
What I found, however, was that for a great number of words the sentiment valence between my AFINN word list and the classifier probability trained on the movie reviews were in disagreemet. A word such as ‘distrustful’ I have as a quite negative word. However, the classifier reports the probability for ‘positive’ to be 0.87, i.e., quite positive. I examined where the word ‘distrustful’ occured in the movie review data set:
$ egrep -ir "\bdistrustful\b" ~/nltk_data/corpora/movie_reviews/
The word ‘distrustful’ appears 3 times and in all cases associated with a ‘positive’ movie review. The word is used to describe elements of the narrative or an outside reference rather than the quality of the movie itself. Another word that I have as negative is ‘criticized’. Used 10 times in the positive moview reviews (and none in the negative) I find one negation (‘the casting cannot be criticized’) but mostly the word in a contexts with the reviewer criticizing the critique of others, e.g., ‘many people have criticized fincher’s filming […], but i enjoy and relish in the portrayal’.
The top 15 ‘misaligned’ words using my ad hoc metric are listed here:
It seems that reviewers are interested in movies that have a certain amount of ‘melancholy’, ‘anger’, distrustfulness and (further down the list) scandal, apathy, hoax, struggle, hopelessness and hindrance. Whereas smile, amusement, peacefulness and gratefulness are associated with negative reviews. So are movie reviewers unempathetic schadefreudians entertained by the characters’ misfortune? Hmmm…? It reminds me of journalism where they say “a good story is a bad story”.
So much for philosophy, back to reality:
The words (such as ‘hurrah’) that have a classifier probability on 0.25 and 0.75 typically occure each only once in the corpus. In this application of the classifier I should perhaps have used a stronger prior probability so ‘hurrah’ with 0.25 would end up on around the middle of the scale with 0.5 as the probability. I haven’t checked whether it is possible to readjust the prior in the NLTK naïve Bayes classifier.
The conclusion on my Thelwallizer is not good. A straightforward application of the classifier on the movie reviews gets you features that look on the summary of the narrative rather than movie per se, so this simple approach is not particular helpful in readjustment of the weights.
However, there is another way the trained classifier can be used. Examining the most informative features I can ask if they exist in my AFINN list. The first few missing words are: slip, ludicrous, fascination, 3000, hudson, thematic, seamless, hatred, accessible, conveys, addresses, annual, incoherent, stupidity, … I cannot use ‘hudson’ in my word list, but words such as ludicrous, seamless and incoherent are surely missing.
(28 January 2012: Lookout in the code below! The way the features are constructed for the classifier is troublesome. In NLTK you should not only specify the words that appear in the text with ‘True’ you should also normally specify explicitely the words that do not appear in the text with ‘False’. Not mentioning words in the feature dictionary might be bad depending on the application)
Some days ago the world press was abuzz with the study on the Facebook friend graph, that found the average distance between active Facebook users to be 4.74, i.e., almost 5, meaning that there are on average 4 Facebook linked friends separating one Facebook user from another. See also brief summary on the Brede Wiki.There are standard algorithms on the shelve to compute the distance for small graphs, but because the Facebook graph is so huge you/they need special algorithms. First author Lars Backstrom employed at Facebook (that gave a keynote at 8th Extended Semantic Web Conference about the Facebook friend suggester) had the Facebook data and got hold on an algorithm from Milano researchers that could handle the 871 million active Facebook users and their 69 milliards friendship links. In a previous study the Milano researchers examined the “spid”, i.e., the variance-to-mean ratio of the distances. They claim that “spid larger than one are to be considered ‘web-like’, whereas networks with a spid smaller than one are to be considered ‘properly social’ and demonstrated that on a number of social and web networks. The Facebook study found a spid on 0.08.
I am confused somewhat by the notion of six degrees of separation. Firstly, does “degrees of separation” mean the number of persons (nodes) or the number of friendships (edge) between a source person and a target persen? Backstrom a Co. “will assume that ‘degree of separation’ is the same as ‘distance minus one’.”, that is, we are counting the persons (nodes) between source person and target person. Another problem is whether the “six” refers to
- the average distance between all pairs,
- the maximum of the average distance for each person,
- the maximum distance between all pairs (the diameter), or
- the average eccentricity; the eccentricity being the maximum distance for each person to any other person.
If you look on the first sentence on the present version of the Wikipedia article I think it alludes to the first interpretation. Playwright John Guare’s six degrees seem rather to be the third interpretation.With the co-authorship graph from the Brede Wiki I can computate these different distances. The co-authors are not fully connected but the largest connected components has 665 names, which resolve to somewhat below 665 people (I got uncorrected problems with, e.g., “Saffron A. Willis-Owen”/”Saffron A. G. Willis-Owen”). On this graph I find the mean distance to be 5.65, the mean eccentricity to be 9.37 and the diameter 12. Computing the spid I get 0.73, i.e., a “social network” according to the Milano criterion. I wonder why the average Facebook distance is so low. Jon Kleinberg mentions “weak ties”. Some of my Facebook friends are linked to public figures in Denmark. Could it be that Facebook users tend to connect with famous persons and that these famous people tend to act as hubs? Another phenomenon that I think I noticed on Facebook is that when people travel abroad and have a cursory acquaintanceship they tend to friendship on Facebook, perhaps as a kind of token and reminder. Are such brief encounters actually there and important for the low average distance?
(2012-01-16: Language correction)
I have previosly written about network mining in a co-author graph in connection with the actress Natalie Portman and her NeuroImage article as well as recently written about co-author mining with the data in the Brede Wiki. Now with the data from the Brede Wiki and NetworkX it is quite easy to find the shortest path between authors once the co-author data is represented in a NetworkX object. It is just a oneline Python:
path = nx.shortest_path(g, u"Finn Årup Nielsen", "Natalie Herslag")
Once printed with “for author in path: print(author)” you get:
Finn Årup Nielsen
Bruce M. Cohen
Abigail A. Baird
I presently have to misspell Natalie Portman’s surname because I entered it incorrectly in the Brede Wiki for some reason.