Image Posted on Updated on
Graph of co-authors from data in the Brede Wiki. Only the authors with a Paul Erdős-number less than 5 are shown. Paul Erdős is in the lower left corner; Karl J. Friston at the top with Erdős-number 4.
Interestingly John J. Sidtis has an Erdős-number on 3 from two independent paths.
I am/was a bit suspicious about Mr. “J. Gilles”. I cannot find much information about him. I got this path from the Erdős–Bacon number article on Wikipedia which leads to Jonathan D. Victor’s page. I suppose Professor Victor knows his coauthors, so there shouldn’t be a problem. J. Gilles is according to Victor “Joseph Gilles”. One Internet page states “During the war Gillis made a significant contribution to code breaking at Bletchley Park” (http://www.jameshamiltonphysicist.com/5-inst-and-queens/). Possibly the same person.
Sentiment colored sequential collaboration network of some of the Wikipedians editing the Wikipedia articles associated with the Lundbeck company. Red are negative sentiment, green are positive.
The “sequential collaboration network” is inspired by Analyzing the creative editing behavior of Wikipedia editors: through dynamic social network analysis. Brian Keegan has also done similar kind of network visualization.
Sentiment analysis is based on the AFINN word list.
I was looking for a value of how clustered a network is. I thought that somewhere in graph spectrum was a good place to start and that in the Python package NetworkX there would be some useful methods. However, I couldn’t immediately see any good methods in NetworkX. Then Morten Mørup mentioned something about community detection and modularity and I became diverged, but now I am back again at the graph spectrum.
The second smallest eigenvalue of the Laplacian matrix of the graph seems to represent reasonably well what I was looking for. Apparently that eigenvalue is called the Algebraic connectivity.
NetworkX has a number of graph generators, and for small test cases the algebraic connectivity seems to give an ok value for how clustered the network is, – or rather how non-clustered it is.
Some days ago the world press was abuzz with the study on the Facebook friend graph, that found the average distance between active Facebook users to be 4.74, i.e., almost 5, meaning that there are on average 4 Facebook linked friends separating one Facebook user from another. See also brief summary on the Brede Wiki.There are standard algorithms on the shelve to compute the distance for small graphs, but because the Facebook graph is so huge you/they need special algorithms. First author Lars Backstrom employed at Facebook (that gave a keynote at 8th Extended Semantic Web Conference about the Facebook friend suggester) had the Facebook data and got hold on an algorithm from Milano researchers that could handle the 871 million active Facebook users and their 69 milliards friendship links. In a previous study the Milano researchers examined the “spid”, i.e., the variance-to-mean ratio of the distances. They claim that “spid larger than one are to be considered ‘web-like’, whereas networks with a spid smaller than one are to be considered ‘properly social’ and demonstrated that on a number of social and web networks. The Facebook study found a spid on 0.08.
I am confused somewhat by the notion of six degrees of separation. Firstly, does “degrees of separation” mean the number of persons (nodes) or the number of friendships (edge) between a source person and a target persen? Backstrom a Co. “will assume that ‘degree of separation’ is the same as ‘distance minus one’.”, that is, we are counting the persons (nodes) between source person and target person. Another problem is whether the “six” refers to
- the average distance between all pairs,
- the maximum of the average distance for each person,
- the maximum distance between all pairs (the diameter), or
- the average eccentricity; the eccentricity being the maximum distance for each person to any other person.
If you look on the first sentence on the present version of the Wikipedia article I think it alludes to the first interpretation. Playwright John Guare’s six degrees seem rather to be the third interpretation.With the co-authorship graph from the Brede Wiki I can computate these different distances. The co-authors are not fully connected but the largest connected components has 665 names, which resolve to somewhat below 665 people (I got uncorrected problems with, e.g., “Saffron A. Willis-Owen”/”Saffron A. G. Willis-Owen”). On this graph I find the mean distance to be 5.65, the mean eccentricity to be 9.37 and the diameter 12. Computing the spid I get 0.73, i.e., a “social network” according to the Milano criterion. I wonder why the average Facebook distance is so low. Jon Kleinberg mentions “weak ties”. Some of my Facebook friends are linked to public figures in Denmark. Could it be that Facebook users tend to connect with famous persons and that these famous people tend to act as hubs? Another phenomenon that I think I noticed on Facebook is that when people travel abroad and have a cursory acquaintanceship they tend to friendship on Facebook, perhaps as a kind of token and reminder. Are such brief encounters actually there and important for the low average distance?
(2012-01-16: Language correction)
I have previosly written about network mining in a co-author graph in connection with the actress Natalie Portman and her NeuroImage article as well as recently written about co-author mining with the data in the Brede Wiki. Now with the data from the Brede Wiki and NetworkX it is quite easy to find the shortest path between authors once the co-author data is represented in a NetworkX object. It is just a oneline Python:
path = nx.shortest_path(g, u"Finn Årup Nielsen", "Natalie Herslag")
Once printed with “for author in path: print(author)” you get:
Finn Årup Nielsen
Bruce M. Cohen
Abigail A. Baird
I presently have to misspell Natalie Portman’s surname because I entered it incorrectly in the Brede Wiki for some reason.
With the SQLite file generated from the Brede Wiki it is relatively easy to perform some simple co-author mining. First one needs to download the SQLite file from the Brede Wiki download site. Here with the unix program ‘wget’:
For a start lets find the author listed with most papers in the Brede Wiki. Starting the sqlite3 client program:
After setup (sqlite> .mode column, sqlite> .width 25) finding the most frequent mentioned authors is one line of SQL:
sqlite> SELECT value, COUNT(*) AS c FROM brede WHERE (template='paper' OR template='conference_paper') AND field = 'author' GROUP BY value ORDER BY c DESC LIMIT 10;
Finn Årup Nielsen 26
Gitte Moos Knudsen 23
Lars Kai Hansen 18
Claus Svarer 15
Olaf B. Paulson 15
Vibe Gedsø Frøkjær 13
Russell A. Poldrack 11
David Erritzøe 10
Richard S. J. Frackowiak 10
William F. C. Baaré 9
That is, e.g., ‘Russ Poldrack‘ is listed with presently 11 papers in the Brede Wiki. For performing a visualization of the co-authors one can query the SQLite database from within Python, first getting the pages with ‘paper’ and ‘conference paper’ template, then query for authors in each of these page and adding the co-authors to a NetworkX graph and draw the graph via GraphViz. The image shows the fourth connected component with 33 authors centered around Jeffrey A. Lieberman. The first connected component has 1490 authors. This number is much higher than the number of researchers in the Brede Wiki that each has a page on the own (520), see the Researcher category.
Also PageRank computation on the co-author graph is straightforward once the data is in the NetworkX graph:
for a, p in sorted(nx.pagerank(g).iteritems(), key=operator.itemgetter(1))[:-11:-1]: print('%.5f %s' % (p, a))
0.00181 Gitte Moos Knudsen
0.00176 Richard S. J. Frackowiak
0.00151 Russell A. Poldrack
0.00142 Finn Årup Nielsen
0.00137 Edward T. Bullmore
0.00134 Klaus-Peter Lesch
0.00134 Karl J. Friston
0.00130 Olaf B. Paulson
0.00128 Thomas E. Nichols
0.00121 Peter T. Fox
(Note 2011-11-18: There is an error as ‘pid’ should have been ‘tid’, i.e., “SELECT DISTINCT tid FROM…”. Using ‘pid’ instead of ‘tid’ will find all authors on a wikipage so also counting those that are cited within the ‘cite journal’ template)
(2012-01-16: Language correction)