network analysis

Brede Wiki co-author graph for Erdős-number < 5

Image Posted on Updated on

Brede Wiki co-author graph for Erdős-number less than 5

Graph of co-authors from data in the Brede Wiki. Only the authors with a Paul Erdős-number less than 5 are shown. Paul Erdős is in the lower left corner; Karl J. Friston at the top with Erdős-number 4.

Interestingly John J. Sidtis has an Erdős-number on 3 from two independent paths.

I am/was a bit suspicious about Mr. “J. Gilles”. I cannot find much information about him. I got this path from the Erdős–Bacon number article on Wikipedia which leads to Jonathan D. Victor’s page. I suppose Professor Victor knows his coauthors, so there shouldn’t be a problem. J. Gilles is according to Victor “Joseph Gilles”. One Internet page states “During the war Gillis made a significant contribution to code breaking at Bletchley Park” (http://www.jameshamiltonphysicist.com/5-inst-and-queens/). Possibly the same person.

Sentiment colored sequential collaboration network

Posted on Updated on

Nielsen2013realtime

Sentiment colored sequential collaboration network of some of the Wikipedians editing the Wikipedia articles associated with the Lundbeck company. Red are negative sentiment, green are positive.

The “sequential collaboration network” is inspired by Analyzing the creative editing behavior of Wikipedia editors: through dynamic social network analysis. Brian Keegan has also done similar kind of network visualization.

Sentiment analysis is based on the AFINN word list.

Graph spectra with NetworkX

Posted on Updated on

Nielsen2012python_graphspectru

I was looking for a value of how clustered a network is. I thought that somewhere in graph spectrum was a good place to start and that in the Python package NetworkX there would be some useful methods. However, I couldn’t immediately see any good methods in NetworkX. Then Morten Mørup mentioned something about community detection and modularity and I became diverged, but now I am back again at the graph spectrum.

The second smallest eigenvalue of the Laplacian matrix of the graph seems to represent reasonably well what I was looking for. Apparently that eigenvalue is called the Algebraic connectivity.

NetworkX has a number of graph generators, and for small test cases the algebraic connectivity seems to give an ok value for how clustered the network is, – or rather how non-clustered it is.

Nine, five, six, twelve or eight degrees of separation in Brede Wiki and four in Facebook

Posted on Updated on

Some days ago the world press was abuzz with the study on the Facebook friend graph, that found the average distance between active Facebook users to be 4.74, i.e., almost 5, meaning that there are on average 4 Facebook linked friends separating one Facebook user from another. See also brief summary on the Brede Wiki.

There are standard algorithms on the shelve to compute the distance for small graphs, but because the Facebook graph is so huge you/they need special algorithms. First author Lars Backstrom employed at Facebook (that gave a keynote at 8th Extended Semantic Web Conference about the Facebook friend suggester) had the Facebook data and got hold on an algorithm from Milano researchers that could handle the 871 million active Facebook users and their 69 milliards friendship links.

In a previous study the Milano researchers examined the “spid”, i.e., the variance-to-mean ratio of the distances. They claim that “spid larger than one are to be considered ‘web-like’, whereas networks with a spid smaller than one are to be considered ‘properly social’ and demonstrated that on a number of social and web networks. The Facebook study found a spid on 0.08.

I am confused somewhat by the notion of six degrees of separation. Firstly, does “degrees of separation” mean the number of persons (nodes) or the number of friendships (edge) between a source person and a target persen? Backstrom a Co. “will assume that ‘degree of separation’ is the same as ‘distance minus one’.”, that is, we are counting the persons (nodes) between source person and target person. Another problem is whether the “six” refers to

  1. the average distance between all pairs,
  2. the maximum of the average distance for each person,
  3. the maximum distance between all pairs (the diameter), or
  4. the average eccentricity; the eccentricity being the maximum distance for each person to any other person.

If you look on the first sentence on the present version of the Wikipedia article I think it alludes to the first interpretation. Playwright John Guare’s six degrees seem rather to be the third interpretation.

Bredewiki-coauthor

With the co-authorship graph from the Brede Wiki I can computate these different distances. The co-authors are not fully connected but the largest connected components has 665 names, which resolve to somewhat below 665 people (I got uncorrected problems with, e.g., “Saffron A. Willis-Owen”/”Saffron A. G. Willis-Owen”). On this graph I find the mean distance to be 5.65, the mean eccentricity to be 9.37 and the diameter 12. Computing the spid I get 0.73, i.e., a “social network” according to the Milano criterion.

I wonder why the average Facebook distance is so low. Jon Kleinberg mentions “weak ties”. Some of my Facebook friends are linked to public figures in Denmark. Could it be that Facebook users tend to connect with famous persons and that these famous people tend to act as hubs? Another phenomenon that I think I noticed on Facebook is that when people travel abroad and have a cursory acquaintanceship they tend to friendship on Facebook, perhaps as a kind of token and reminder. Are such brief encounters actually there and important for the low average distance?

https://gist.github.com/1392607

(2012-01-16: Language correction)

Natalie Portman in NetworkX

Posted on Updated on

I have previosly written about network mining in a co-author graph in connection with the actress Natalie Portman and her NeuroImage article as well as recently written about co-author mining with the data in the Brede Wiki. Now with the data from the Brede Wiki and NetworkX it is quite easy to find the shortest path between authors once the co-author data is represented in a NetworkX object. It is just a oneline Python:

path = nx.shortest_path(g, u"Finn Årup Nielsen", "Natalie Herslag")

Once printed with “for author in path: print(author)” you get:
Finn Årup Nielsen
Nicholas Lange
Bruce M. Cohen
Abigail A. Baird
Natalie Herslag

I presently have to misspell Natalie Portman’s surname because I entered it incorrectly in the Brede Wiki for some reason.


# wget http://neuro.imm.dtu.dk/services/bredewiki/download/bredewiki-templates.sqlite3
import matplotlib.pyplot as plt
import networkx as nx
from pysqlite2 import dbapi2
connection = dbapi2.Connection('bredewiki-templates.sqlite3')
sql = "SELECT DISTINCT tid FROM brede WHERE (template='paper' OR template='conference_paper');"
cursor = connection.cursor()
cursor.execute(sql)
tids = [ row[0] for row in cursor.fetchall() ]
g = nx.Graph()
sql = "SELECT value FROM brede WHERE tid=? AND field='author'";
for tid in tids:
cursor.execute(sql, (tid,))
authors = [ row[0] for row in cursor.fetchall() ]
for n in range(len(authors)):
for m in range(len(authors)):
g.add_edge(authors[n], authors[m])
path = nx.shortest_path(g, u"Finn Årup Nielsen", "Natalie Herslag")
for author in path: print(author)

view raw

gistfile1.py

hosted with ❤ by GitHub

Co-author mining for papers in the Brede Wiki

Posted on Updated on

Coauthors

With the SQLite file generated from the Brede Wiki it is relatively easy to perform some simple co-author mining. First one needs to download the SQLite file from the Brede Wiki download site. Here with the unix program ‘wget’:

wget http://neuro.imm.dtu.dk/services/bredewiki/download/bredewiki-templates.sqlite3

For a start lets find the author listed with most papers in the Brede Wiki. Starting the sqlite3 client program:

sqlite3 bredewiki-templates.sqlite3

After setup (sqlite> .mode column, sqlite> .width 25) finding the most frequent mentioned authors is one line of SQL:

sqlite> SELECT value, COUNT(*) AS c FROM brede WHERE (template='paper' OR template='conference_paper') AND field = 'author' GROUP BY value ORDER BY c DESC LIMIT 10;

Finn Årup Nielsen 26
Gitte Moos Knudsen 23
Lars Kai Hansen 18
Claus Svarer 15
Olaf B. Paulson 15
Vibe Gedsø Frøkjær 13
Russell A. Poldrack 11
David Erritzøe 10
Richard S. J. Frackowiak 10
William F. C. Baaré 9

That is, e.g., ‘Russ Poldrack‘ is listed with presently 11 papers in the Brede Wiki. For performing a visualization of the co-authors one can query the SQLite database from within Python, first getting the pages with ‘paper’ and ‘conference paper’ template, then query for authors in each of these page and adding the co-authors to a NetworkX graph and draw the graph via GraphViz. The image shows the fourth connected component with 33 authors centered around Jeffrey A. Lieberman. The first connected component has 1490 authors. This number is much higher than the number of researchers in the Brede Wiki that each has a page on the own (520), see the Researcher category.

Also PageRank computation on the co-author graph is straightforward once the data is in the NetworkX graph:

import operator
for a, p in sorted(nx.pagerank(g).iteritems(), key=operator.itemgetter(1))[:-11:-1]: print('%.5f %s' % (p, a))


0.00181 Gitte Moos Knudsen
0.00176 Richard S. J. Frackowiak
0.00151 Russell A. Poldrack
0.00142 Finn Årup Nielsen
0.00137 Edward T. Bullmore
0.00134 Klaus-Peter Lesch
0.00134 Karl J. Friston
0.00130 Olaf B. Paulson
0.00128 Thomas E. Nichols
0.00121 Peter T. Fox

https://gist.github.com/1373887

(Note 2011-11-18: There is an error as ‘pid’ should have been ‘tid’, i.e., “SELECT DISTINCT tid FROM…”. Using ‘pid’ instead of ‘tid’ will find all authors on a wikipage so also counting those that are cited within the ‘cite journal’ template)

(2012-01-16: Language correction)

Social media paranoia

Posted on Updated on

In my effort to be updated on and investigate social media I got an account on one of the large Facebook-wannabee websites with social network facilities. I knew of noone on the site and was therefore fairly surprised when its friend-suggester came up with a person that I knew, – and only that person! How was the website able to know I was connected to that person? There is excessively little information on the public Internet to connect me with that person. The person is in another place, is another age and is in another business. If I google on the public web I find no pages that mention me and the person on the same page. The way that I logged into the social website was independent of other social web-sites: I didn’t explicitly tell the website about my other accounts on Twitter, Facebook, MySpace, LinkedIn, Xing, Tumblr or Posterous. Thus it could not get access to my social network through me, so the website must have gotten this relatively private information from somewhere. How?

I will come back to that. First a bit on other issues of privacy.

I recently went to the Extended Semantic Web Conference where Abe Hsuan provided one of the fine keynote talks. He focused on privacy on the Web and the “Data Valdez”. Among the topic he addressed were:

  • The Dog Poop Girl from Seoul who was photographed by an anonymous subway passenger. The girl’s dog had shit on the floor of the Seoul subway train and the girl was so embarrassed that she left it there. As the photo was released an Internet storm arose against the poor girl, her identity and personal details being revealed.
  • In the AOL Data Valdez the company released 20 million Web search queries in 2006 and with a bit of compiling journalist could reveal the identity of individual users, e.g., a 62-year old woman and her search queries on “60 single men” and other personal searches.
  • Netflix Personalization Challenge where researchers could break the anonymization in the video rental company’s data by correlating data with IMDb.
  • Pandora’s Android App that appears to send user’s birth date, gender and GPS information to advertising companies according to an analysis by Veracode.

Hsuan also pointed to a whole series of companies that specializes in correlating information across and beyond the Web: bluecava, Blue Kai, Epsilon/Abacus, TargusInfo, brilig.com, Sense Networks, Ingenix (prescription drug history, therapeutic outcomes and billing information), face.com (facial recognition technology). In April 2011 researchers reported that Apple devices stored lists of locations with timestamps without the user acknowledgment. This is just to help the user to get faster geolocation through wifi and mobil phone tower data rather than slow GPS According to Apple. However, with access to the unencrypted backup of the device you will be able to observe the travels of the device user.

Google got itself into a lawsuit after collecting and transmitting location data on the Android platform, see here.

Revealing too much about your location in public may give thieves a good opportunity and a Danish insurance company advices users to remove the Facebook Places application. There is an asymmetry in knowledge: The thieves know when you are away from your house, but the thieves are not willing to reveal that they are in your house. The interesting website Social Clusters by Morten Barklund allows you to make intelligent visualizations of your friend network from Facebook. To enable that you have to reveal your friend network to the Web service. Though eager to try it out, I was too reluctant to reveal my network. Regardless, I am probably already in the database as some people in my friend network on Facebook have signed up, i.e., I am not among the presently 102 registered users but very likely among the presently 28’513 connected persons. Registration may not be necessary to reveal you friend network. If one among your Facebook friends has an open profile some information about you is revealed even if you have a closed profile. According to a study a third of Danes on social networks regularly upload photos of people other than themselves. And among these a fourth has an open profile. So if you have more than 12 friends there is a fair chance/risk that an image of you is accessible to non-friends even if you have a closed profile and never uploaded images of yourself.

Now back to the social website that guessed right with its friend suggester. How did it do it? Here are some suggestions:

  • Facebook could have revealed my friend network to the website. This option is unlikely given that Facebook and the website is competitors.
  • The website could have obtained information from the public information on Facebook. I think this is also unlikely. Facebook would not allow a competitor to crawl its website to acquire the friend network.
  • Before I understood that Facebook applications were actually thirdparties and not just keep the data within Facebook I added a few applications. One among them was the Friend Wheel. I do not know what the Friend Wheel does with my data, but I don’t think that it has got to the social website.
  • A likely path for the data is that the other person logged in via Facebook so the other social website could get hold on the Facebook friend network. As I was in this network and my name is pretty unique the social website could match up my name with the name in the friend network.

Navigating the Natalie Portman graph: Finding a co-author path to a NeuroImage author

Posted on Updated on

Portman

Hollywood actress Natalie Portman I first remarked in the Mike Nichols 2004 film Closer. According to rumor on the Internet a few years before Closer she co-authored a functional neuroimaging scientific article called Frontal lobe activation during object permanence: data from near-infrared spectroscopy. She was attributed as Natalie Hershlag.

I have written before of data mining a co-author graph for the Erdös number and “Hayashi” number, and I wondered if it would be possible to find a co-author path from Portman to me. And indeed yes.

Abigail A. Baird first-authored Portman’s article, and the article Functional magnetic resonance imaging of facial affect recognition in children and adolescents has Abigail Baird and psychiatry professor Bruce M. Cohen among the authors. Bruce M. Cohen and Nicholas Lange is among the co-authors on Structural brain magnetic resonance imaging of limbic and thalamic volumes in pediatric bipolar disorder and Lange and I are linked through our Plurality and resemblance in fMRI data analysis, — an article that contrasted different fMRI analysis methods.

So the co-author path between Portman and me is: Portman – Baird – Cohen – Lange – me, which bring my “Portman number” to 4.

Navigating a graph is a general problem if you only know the local connections. There has even been written scientific articles about it, e.g., Jon Kleinberg‘s Navigating in a small world. When a human (such as I) navigate a social graph such as the co-author graph of scientific articles one can utilize auxillary information, here the information about where a researcher has worked, what his/her interest are and how prominent the researcher is (how many co-authors s/he has). As Portman worked from Harvard a good guess would be to start looking among my co-authors that are near Harvard. Nicholas Lange is from Harvard and we collaborated in the American funded Human Brain Project. I knew that radiology professor Bruce R. Rosen was/is a central figure in Boston MRI researcher, so I thought that there might be a productive connection from him, — both to Lange and to Portman. Portman’s co-author Baird is professor and has written some neuroimaging papers, so among Portman’s co-authors Baird was probably the one that could lead to a path. While searching among Lange and Baird co-authors I confused Bruce Rosen and Bruce Cohen (their Hamming distance is not great). This error proved fertile.

If I didn’t run into Cohen and really wanted to find a path between Portman and me then I think a more automated and brute force method could have been required. One way would be to query PubMed and put the co-author graph into NetworkX which is a Python package. It has a shortest path algorithm. Joe Celko in his book SQL for Smarties: Advanced SQL programming shows a shortest path algorithm in SQL. That might be an alternative to NetworkX.

(Photo: gdcgraphics, CC-by, taken from Wikimedia Commons)