With Professor Lars Kai Hansen I am presently looking into retweeting on Twitter. A 2010 scientific article Want to be retweeted? Large scale Analytics on factors impacting retweet in Twitter network by Suh, Hong, Pirolli and Ed H. Chi, examined what the variables hashtag, “@”, number of followers, number of followees, age of account, number of tweets, number of favorited tweets and number of tweets have of effect on whether a tweet is retweeted.

The article also points to Dan Zarrella’s previous writings. He has a blog as well as the slides The Science of ReTweets. Zarrella reports (on page 11 in the slides) statistics on the fraction of retweets with URLs and it is well over 50%, Suh & Co. writes it to be 56.69% to be exacts.

This fraction does not fit with what Suh & Co. find. They say only 28.4% of retweets have URLs.

To investigate this discrepancy I looked into the tweets I had downloaded. The tweets were downloaded with the streaming method provided by Twitter that I heard of through Bjarne Ørum Wahlgreen. I am furthermore using the MongoDB noSQL database for storing at the moment (I used SQLite before). It means that you can write the downloading and storing in one Unix line which is (with a tip from Eliot):

curl -u: | mongoimport -d twitter -c tweets

I have only a bit above 330’000 tweets in my database at the moment, but my results align better with Suh & Co than with Zarrella. The result depends on the matching of a retweet. For my most broadest I get 25.2%.

Furthermore, I find that the fraction of tweets with URLs is 19.1% which is in alignment with both Zarella and Suh & Co that both report around 20%. I find the fraction of retweets to the total to be in the range 9-16%.

The detailed results are here:

Total                   330000 100.0%
With URLs                62901  19.1%
Retweet ^RT @            30950   9.4% of total
Retweet with URLs ^RT @   7692   2.3% of total
                                24.9% of retweets
                                12.2% of tweets with URLs
Retweet ^RTb            31516   9.6% of total
Retweet with URLs ^RTb   7879   2.4% of total
                                25.0% of retweets
                                12.5% of tweets with URLs
Retweet bRTb           52633  15.9% of total
Retweet with URLs bRTb 13239   4.0% of total
                                25.2% of retweets
                                21.0% of tweets with URLs

Suh & Co. found that hashtags were associated with increased retweeting. On a blog one of the authors writes “Want to be Retweeted? Add Hashtags to Your Tweets!”. I doubt that the causal relationship is that simple. I think it is more likely that a common effect (e.g., that the tweet is informative and well-written) causes the tweet to get hashtag(s) and be retweeted.

I build a small Python program to do the processing:

from __future__ import division
import pymongo
from re import compile, search, IGNORECASE, UNICODE 

connection = pymongo.Connection()
db = connection.twitter
tweets = db.tweets
pattern_url = compile(r"http://", IGNORECASE)
stringpatterns_retweet = [ r"^RT @", r"^RTb", r"bRTb" ]
patterns_retweet = [ compile(s, UNICODE) for s in stringpatterns_retweet ]
total = 0
withurls = 0
retweets = [ 0 ] * len(stringpatterns_retweet)
retweets_withurls = [ 0 ] * len(stringpatterns_retweet)
for tweet in tweets.find({"delete": {"$exists": False}}):
    total += 1
    if search(pattern_url, tweet.get('text', '')):
        withurls += 1
        urlpresent = True
        urlpresent = False
    for n in range(len(patterns_retweet)):
        if search(patterns_retweet[n], tweet.get('text', '')):
            retweets[n] += 1
            if urlpresent:
                retweets_withurls[n] += 1
        if not total % 10000:
            print(""" Total %23d 100.0%% With URLs %23d %5.1f%%""" % (total, withurls, 100*withurls/total))
            for n in range(len(patterns_retweet)):
                print("""Retweet %20s %7d %5.1f%% of total Retweet with URLs %10s %7d %5.1f%% of total  %5.1f%% of retweets  %5.1f%% of tweets with URLs""" % (  stringpatterns_retweet[n], retweets[n],
100*retweets[n]/total,  stringpatterns_retweet[n],  retweets_withurls[n], 100*retweets_withurls[n]/total,  100*retweets_withurls[n]/retweets[n],  100*retweets_withurls[n]/withurls))

(Note 21. september 2010: I had some problems displaying the results and code on this blog homepage. I have now reedited parts of the text, and I hope my edits didn’t disturb the original message or introduced errors. For the python code see also a more recent blogpost)


