In our Responsible Business in the Blogosphere project we are mining the blogosphere. So far we have mostly considered the microblogsphere represented by Twitter. We got two research articles on that topic: Good Friends, Bad News – Affect and Virality in Twitter and A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.
One of the reasons why we focus on Twitter is that the data we can get is structured. You get a structure in the JSON format back from the Twitter web service that is easy to handle. General blogs are somewhat more difficult to handle: First you need to find the blogs and then you need to extract the relevant data from the webpage. This is not particular easy though some interesting tools exist for these tasks.
There are some blogsites that provide relatively easy structured
information. The blogsite that I use, Posterous, provides an API that let users and programmers download information. There are actually two versions: The old one provides information in the XML format the other newer in JSON.
In my initial effort to make something useful I looked on my own blog using the old API. You need to call a URL with something like:
XML is returned. I did not manage to parse the XML in a structured way (using standard Python libraries) but used an ad hoc approach to turn the XML into a JSON-like structure with the numerical fields converted to numbers and the ‘body’ field with the actual text maintained as HTML. Apart from the postings themselves there are substructures for comments and media files that you might want to handle.
In this first application I manage to plot the number of views of each blog post as a function of the date. The two articles that got the most views are one about the Milena Penkowa case, the other with with Natalie Portman that was in the news due to her recent film Black Swan. Earlier articles that received a substantial numbers of views – more than normal – were more nerdish accounts of my problems with Ubuntu. My most recent articles have a fairly low number of views. I have several theories why that is.
How easy it is to crawl all Posterous blogs I do not yet know. Compared to Twitter the data you get are less social. In Twitter you have loads of retweets and direct messages between users that you can analyze. In Posterous you do have what corresponds to friends and followers by what is called ‘my subscriptions’. You also have comments.
The Python code that does the plotting is here:
views = [ p['views'] for p in posterous ] now = datetime.datetime.now(pytz.utc) dates = [ dateutil.parser.parse(p['date']) for p in posterous ] since = [ (now - date).days for date in dates ] plot(since, views, 'yo-') ylabel('Views') xlabel('Days since now') title('Views on Posterous') for (d,v,p),a in zip( filter(lambda (d, v, p): v > -6.64 * d + 5000, zip(since, views, posterous)), ['left', 'left', 'center', 'left']): text(d,v,p['title'][:31] + '...', horizontalalignment=a)
The last for-loop is at least PG-13 rated and should not be attempted at home.