Month: September 2010

Not to match with regular expression?

Posted on Updated on

Apparently, it is not excessively easy to use Perl-based regular expressions to match expressions with that does not contain a combination of characters.

One case where this is relevant is in Twitter where you want to find “mentions” in the Twitter text. Mentions may be indicated with a ‘@user’. If you do not want to include retweets as “mentions” you need to exclude these tweets. Retweets are usually indicated with “RT @user”. In this case you want to find instances of “@user” that are not preceeded with “RT” or any of its variants, e.g., “RT: “. The problem occurs in the article Want to be retweeted? Large scale Analytics on factors impacting retweet in Twitter network. See also my previous post Twitter retweet analysis.

My first attempts on the non-matching problem with Python re module are here:

>>> import re
>>> re.findall(r"(?:bRT:?s*){0}@(w+)", "@anders RT @bjarne")['anders', 'bjarne']
>>> re.findall(r"(?:RT:?s*){0,0}@(w+)", "@anders RT @bjarne") ['anders', 'bjarne']

Here the task is to match “anders” and not “bjarne”, and there is no success. The perlre manual turns out to be of some help. There is the “zero-width negative look-ahead” which is written with this code: “(?!pattern)”. What you want is, however, a negative look-behind. That one is written with “(?<!pattern)”. However, these patterns only work for fixed-width look-behind. So you could write the following regular expression which is not perfect, but covers quite a good percentage of tweets:

>>> re.findall(r"(?<!bRT )@(w+)", "@anders RT @bjarne")['anders']

It is not easy to circumvent the fixed-width problem. The following two examples wont work:

>>> re.findall(r"(?<!bRT)s*@(w+)", "@anders RT @bjarne")['anders', 'bjarne']
>>> re.findall(r"(?:(?<!bRT )|(?<!bRT: ))@(w+)", "@anders RT @bjarne")['anders', 'bjarne']

Inspired by the perlre manual and its suggestion “if (/bar/ && $` !~ /foo$/)” you can do something similar with two regular expressions:

>>> [s[1:] for s in re.findall(r"((?:bRT:?s*)?@w+)", "@anders RT @bjarne") if not re.match(r"^RT", s) ]['anders']

Not necessarily pretty.

Advertisements

Part-of-speech tagging with NLTK in Python and "Jensen and Sons"

Posted on Updated on

One old joke is about Mr. Jensen that has a company with his two sons. Mr. Jensen orders a new sign to put up in front of his shop displaying “Jensen and Sons”. The sign painter sends Mr. Jensen a draft. However, Mr. Jensen is not satisfied with the position of the letters and he writes back to the painter: “I would like to have more space between Jensen and and and and and Sons”.

Our course in Python programming (Technical University of Denmark course 02820) is presently running. One of the topics of the course is text and web mining with the associated slides available.

Natural Language Toolkit (NLTK) Python module provides convenient support for some text mining approaches. Part-of speech tagging (POS-tagging) is part of NLTK and it is fairly straightforward to use. Working on the joke the Python code goes like this:

>>> import nltk 
>>> s = "I would like to have more space between Jensen and and and and and Sons"
>>> words = nltk.word_tokenize(s)
>>> nltk.pos_tag(words) 
[('I', 'PRP'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('have', 'VB'), ('more', 'JJR'), ('space', 'NN'), ('between', 'IN'), ('Jensen', 'NNP'), ('and', 'CC'), ('and', 'CC'), ('and', 'CC'), ('and', 'CC'), ('and', 'CC'), ('Sons', 'NNPS')]

‘And’ is tagged as a ‘coordinating conjunction’. I am not sure NLTK sees the joke there.

Visiting Amsterdam 2010

Posted on

Going to Amsterdam? There are alternatives to Anne and Vincent. I recently went to Amsterdam and found that the old guidebooks (from 2000 and 2009) weren’t quite up to date and that several museums are under reconstruction, so an update for tourist prospects might be warranted.

I didn’t visit the Rijksmuseum but it seems to undergo some kind of repair. Also the contemporary art museum Stedelijk is rebuilding. A new wing is not yet finished, but you can get into the old part for the price of some Euros. On the ground floor I first felt cheated: One part of the exhibition is just showing mostly empty rooms so we can have “an opportunity to directly experience the luminous, gracefully proportioned gallery spaces themselves” as they write. Well, at least you can look on the other tourists. Upstair are, however, some showoffs. One Barbara Kruger has decorated a room with big black and white words on the walls and the floor. Job Koelewijn has stolen a visual illusion which I believe is called rotating snakes that works from the peripheral drift illusion as far as I understand. The original thing about Koelewijn’s work is that he made it on the floor of the museum with colored sand. One guidebook (if I remember correctly) indicated that a temporary Stedelijk was located around Oosterdok, but no Stedelijk could be found there. Stedelijk was where it usually is. The Nederlands Scheepvaart Museum was closed.

There are several attraction with no entry fee. The new fancy library is east of the central station designed by Jo Coenen has great view over Amsterdam, cosy chairs and posh interior. The NEMO science museum has a staircase roof, also with a good view. Also in the area of Oosterdok is the new concert hall Muziekgebouw designed by 3XN. It had a small (temporary?) exhibition of strange sound producing art installations. The path under one of the bridges towards the new concert hall has also a sound installation. One small possibly interesting museum, that I don’t recall the name or location on, was closed.

Upstairs Pannenkoekenhuis in Grimburgwal 2 is lonelyplaneted and a very small pancake-servering café. Morlang café restaurant at Keizersgracht 451 could provide lunch. The channels are still there.

(24/28 september 2010: Typo correction)

Brede Database in Berlin

Posted on Updated on

Bredeinberlin

I recently gave a talk in St. Hedwig-Krankenhaus about the Brede Database and the related Brede Wiki and Brede Toolbox.

A PDF of the slides is available here.

This is a test: Python program in Posterous with Markdown

Posted on

I am using Posterous as the blogging software andsometimes I include computer code in the blog post. Posterous doesn’t likethat. I have included code in the ‘’ tag, but Posterous formats thatin an ackward way. My previous post on Twitter retweet analysis wasformatted wrongly. Posterous claimsto support the Markdown language,so I tried to edit and insert the markdown tag in the raw HTML, but thenit went completely wrong: My code and results were erased!

Now I will try to include the erased code in this blog, submitted by emailand formatted with markdown. According to some markdown documentationcode needs to be indented at least four characters, so that is what I willdo:

from __future__ import division import pymongo from re import compile, search, IGNORECASE, UNICODE connection = pymongo.Connection() db = connection.twitter tweets = db.tweets pattern_url = compile(r"http://", IGNORECASE) stringpatterns_retweet = [ r"^RT @", r"^RTb", r"bRTb" ] patterns_retweet = [ compile(s, UNICODE) for s in stringpatterns_retweet ] total = 0 withurls = 0 retweets = [ 0 ] * len(stringpatterns_retweet) retweets_withurls = [ 0 ] * len(stringpatterns_retweet) for tweet in tweets.find({"delete": {"$exists": False}}):     total += 1     if search(pattern_url, tweet.get('text', '')):         withurls += 1         urlpresent = True     else:         urlpresent = False     for n in range(len(patterns_retweet)):         if search(patterns_retweet[n], tweet.get('text', '')):             retweets[n] += 1             if urlpresent:                 retweets_withurls[n] += 1     if not total % 10000:         print(""" Total         %23d    100.0%% With URLs     %23d    %5.1f%%""" % (total, withurls, 100*withurls/total))         for n in range(len(patterns_retweet)):             print("""Retweet %20s  %7d    %5.1f%% of total Retweet with URLs %10s  %7d    %5.1f%% of total                                          %5.1f%% of retweets                                          %5.1f%% of tweets with URLs""" % (                     stringpatterns_retweet[n], retweets[n],                     100*retweets[n]/total,                     stringpatterns_retweet[n],                     retweets_withurls[n], 100*retweets_withurls[n]/total,                     100*retweets_withurls[n]/retweets[n],                     100*retweets_withurls[n]/withurls))

Test post

Posted on

Hallo

  • One
  • Two

 

# Author
1 Frederik
2 Nanna

Twitter retweet analysis

Posted on Updated on

With Professor Lars Kai Hansen I am presently looking into retweeting on Twitter. A 2010 scientific article Want to be retweeted? Large scale Analytics on factors impacting retweet in Twitter network by Suh, Hong, Pirolli and Ed H. Chi, examined what the variables hashtag, “@”, number of followers, number of followees, age of account, number of tweets, number of favorited tweets and number of tweets have of effect on whether a tweet is retweeted.

The article also points to Dan Zarrella’s previous writings. He has a blog as well as the slides The Science of ReTweets. Zarrella reports (on page 11 in the slides) statistics on the fraction of retweets with URLs and it is well over 50%, Suh & Co. writes it to be 56.69% to be exacts.

This fraction does not fit with what Suh & Co. find. They say only 28.4% of retweets have URLs.

To investigate this discrepancy I looked into the tweets I had downloaded. The tweets were downloaded with the streaming method provided by Twitter that I heard of through Bjarne Ørum Wahlgreen. I am furthermore using the MongoDB noSQL database for storing at the moment (I used SQLite before). It means that you can write the downloading and storing in one Unix line which is (with a tip from Eliot):

curl http://stream.twitter.com/1/statuses/sample.json -u: | mongoimport -d twitter -c tweets

I have only a bit above 330’000 tweets in my database at the moment, but my results align better with Suh & Co than with Zarrella. The result depends on the matching of a retweet. For my most broadest I get 25.2%.

Furthermore, I find that the fraction of tweets with URLs is 19.1% which is in alignment with both Zarella and Suh & Co that both report around 20%. I find the fraction of retweets to the total to be in the range 9-16%.

The detailed results are here:

Total                   330000 100.0%
With URLs                62901  19.1%
Retweet ^RT @            30950   9.4% of total
Retweet with URLs ^RT @   7692   2.3% of total
                                24.9% of retweets
                                12.2% of tweets with URLs
Retweet ^RTb            31516   9.6% of total
Retweet with URLs ^RTb   7879   2.4% of total
                                25.0% of retweets
                                12.5% of tweets with URLs
Retweet bRTb           52633  15.9% of total
Retweet with URLs bRTb 13239   4.0% of total
                                25.2% of retweets
                                21.0% of tweets with URLs

Suh & Co. found that hashtags were associated with increased retweeting. On a blog one of the authors writes “Want to be Retweeted? Add Hashtags to Your Tweets!”. I doubt that the causal relationship is that simple. I think it is more likely that a common effect (e.g., that the tweet is informative and well-written) causes the tweet to get hashtag(s) and be retweeted.

I build a small Python program to do the processing:

from __future__ import division
import pymongo
from re import compile, search, IGNORECASE, UNICODE 

connection = pymongo.Connection()
db = connection.twitter
tweets = db.tweets
pattern_url = compile(r"http://", IGNORECASE)
stringpatterns_retweet = [ r"^RT @", r"^RTb", r"bRTb" ]
patterns_retweet = [ compile(s, UNICODE) for s in stringpatterns_retweet ]
total = 0
withurls = 0
retweets = [ 0 ] * len(stringpatterns_retweet)
retweets_withurls = [ 0 ] * len(stringpatterns_retweet)
for tweet in tweets.find({"delete": {"$exists": False}}):
    total += 1
    if search(pattern_url, tweet.get('text', '')):
        withurls += 1
        urlpresent = True
    else:
        urlpresent = False
    for n in range(len(patterns_retweet)):
        if search(patterns_retweet[n], tweet.get('text', '')):
            retweets[n] += 1
            if urlpresent:
                retweets_withurls[n] += 1
        if not total % 10000:
            print(""" Total %23d 100.0%% With URLs %23d %5.1f%%""" % (total, withurls, 100*withurls/total))
            for n in range(len(patterns_retweet)):
                print("""Retweet %20s %7d %5.1f%% of total Retweet with URLs %10s %7d %5.1f%% of total  %5.1f%% of retweets  %5.1f%% of tweets with URLs""" % (  stringpatterns_retweet[n], retweets[n],
100*retweets[n]/total,  stringpatterns_retweet[n],  retweets_withurls[n], 100*retweets_withurls[n]/total,  100*retweets_withurls[n]/retweets[n],  100*retweets_withurls[n]/withurls))

(Note 21. september 2010: I had some problems displaying the results and code on this blog homepage. I have now reedited parts of the text, and I hope my edits didn’t disturb the original message or introduced errors. For the python code see also a more recent blogpost)