Image Posted on Updated on
Graph of co-authors from data in the Brede Wiki. Only the authors with a Paul Erdős-number less than 5 are shown. Paul Erdős is in the lower left corner; Karl J. Friston at the top with Erdős-number 4.
Interestingly John J. Sidtis has an Erdős-number on 3 from two independent paths.
I am/was a bit suspicious about Mr. “J. Gilles”. I cannot find much information about him. I got this path from the Erdős–Bacon number article on Wikipedia which leads to Jonathan D. Victor’s page. I suppose Professor Victor knows his coauthors, so there shouldn’t be a problem. J. Gilles is according to Victor “Joseph Gilles”. One Internet page states “During the war Gillis made a significant contribution to code breaking at Bletchley Park” (http://www.jameshamiltonphysicist.com/5-inst-and-queens/). Possibly the same person.
I was looking for a value of how clustered a network is. I thought that somewhere in graph spectrum was a good place to start and that in the Python package NetworkX there would be some useful methods. However, I couldn’t immediately see any good methods in NetworkX. Then Morten Mørup mentioned something about community detection and modularity and I became diverged, but now I am back again at the graph spectrum.
The second smallest eigenvalue of the Laplacian matrix of the graph seems to represent reasonably well what I was looking for. Apparently that eigenvalue is called the Algebraic connectivity.
NetworkX has a number of graph generators, and for small test cases the algebraic connectivity seems to give an ok value for how clustered the network is, – or rather how non-clustered it is.
Some days ago the world press was abuzz with the study on the Facebook friend graph, that found the average distance between active Facebook users to be 4.74, i.e., almost 5, meaning that there are on average 4 Facebook linked friends separating one Facebook user from another. See also brief summary on the Brede Wiki.There are standard algorithms on the shelve to compute the distance for small graphs, but because the Facebook graph is so huge you/they need special algorithms. First author Lars Backstrom employed at Facebook (that gave a keynote at 8th Extended Semantic Web Conference about the Facebook friend suggester) had the Facebook data and got hold on an algorithm from Milano researchers that could handle the 871 million active Facebook users and their 69 milliards friendship links. In a previous study the Milano researchers examined the “spid”, i.e., the variance-to-mean ratio of the distances. They claim that “spid larger than one are to be considered ‘web-like’, whereas networks with a spid smaller than one are to be considered ‘properly social’ and demonstrated that on a number of social and web networks. The Facebook study found a spid on 0.08.
I am confused somewhat by the notion of six degrees of separation. Firstly, does “degrees of separation” mean the number of persons (nodes) or the number of friendships (edge) between a source person and a target persen? Backstrom a Co. “will assume that ‘degree of separation’ is the same as ‘distance minus one’.”, that is, we are counting the persons (nodes) between source person and target person. Another problem is whether the “six” refers to
- the average distance between all pairs,
- the maximum of the average distance for each person,
- the maximum distance between all pairs (the diameter), or
- the average eccentricity; the eccentricity being the maximum distance for each person to any other person.
If you look on the first sentence on the present version of the Wikipedia article I think it alludes to the first interpretation. Playwright John Guare’s six degrees seem rather to be the third interpretation.With the co-authorship graph from the Brede Wiki I can computate these different distances. The co-authors are not fully connected but the largest connected components has 665 names, which resolve to somewhat below 665 people (I got uncorrected problems with, e.g., “Saffron A. Willis-Owen”/”Saffron A. G. Willis-Owen”). On this graph I find the mean distance to be 5.65, the mean eccentricity to be 9.37 and the diameter 12. Computing the spid I get 0.73, i.e., a “social network” according to the Milano criterion. I wonder why the average Facebook distance is so low. Jon Kleinberg mentions “weak ties”. Some of my Facebook friends are linked to public figures in Denmark. Could it be that Facebook users tend to connect with famous persons and that these famous people tend to act as hubs? Another phenomenon that I think I noticed on Facebook is that when people travel abroad and have a cursory acquaintanceship they tend to friendship on Facebook, perhaps as a kind of token and reminder. Are such brief encounters actually there and important for the low average distance?
(2012-01-16: Language correction)
I have previosly written about network mining in a co-author graph in connection with the actress Natalie Portman and her NeuroImage article as well as recently written about co-author mining with the data in the Brede Wiki. Now with the data from the Brede Wiki and NetworkX it is quite easy to find the shortest path between authors once the co-author data is represented in a NetworkX object. It is just a oneline Python:
path = nx.shortest_path(g, u"Finn Årup Nielsen", "Natalie Herslag")
Once printed with “for author in path: print(author)” you get:
Finn Årup Nielsen
Bruce M. Cohen
Abigail A. Baird
I presently have to misspell Natalie Portman’s surname because I entered it incorrectly in the Brede Wiki for some reason.
With the SQLite file generated from the Brede Wiki it is relatively easy to perform some simple co-author mining. First one needs to download the SQLite file from the Brede Wiki download site. Here with the unix program ‘wget’:
For a start lets find the author listed with most papers in the Brede Wiki. Starting the sqlite3 client program:
After setup (sqlite> .mode column, sqlite> .width 25) finding the most frequent mentioned authors is one line of SQL:
sqlite> SELECT value, COUNT(*) AS c FROM brede WHERE (template='paper' OR template='conference_paper') AND field = 'author' GROUP BY value ORDER BY c DESC LIMIT 10;
Finn Årup Nielsen 26
Gitte Moos Knudsen 23
Lars Kai Hansen 18
Claus Svarer 15
Olaf B. Paulson 15
Vibe Gedsø Frøkjær 13
Russell A. Poldrack 11
David Erritzøe 10
Richard S. J. Frackowiak 10
William F. C. Baaré 9
That is, e.g., ‘Russ Poldrack‘ is listed with presently 11 papers in the Brede Wiki. For performing a visualization of the co-authors one can query the SQLite database from within Python, first getting the pages with ‘paper’ and ‘conference paper’ template, then query for authors in each of these page and adding the co-authors to a NetworkX graph and draw the graph via GraphViz. The image shows the fourth connected component with 33 authors centered around Jeffrey A. Lieberman. The first connected component has 1490 authors. This number is much higher than the number of researchers in the Brede Wiki that each has a page on the own (520), see the Researcher category.
Also PageRank computation on the co-author graph is straightforward once the data is in the NetworkX graph:
for a, p in sorted(nx.pagerank(g).iteritems(), key=operator.itemgetter(1))[:-11:-1]: print('%.5f %s' % (p, a))
0.00181 Gitte Moos Knudsen
0.00176 Richard S. J. Frackowiak
0.00151 Russell A. Poldrack
0.00142 Finn Årup Nielsen
0.00137 Edward T. Bullmore
0.00134 Klaus-Peter Lesch
0.00134 Karl J. Friston
0.00130 Olaf B. Paulson
0.00128 Thomas E. Nichols
0.00121 Peter T. Fox
(Note 2011-11-18: There is an error as ‘pid’ should have been ‘tid’, i.e., “SELECT DISTINCT tid FROM…”. Using ‘pid’ instead of ‘tid’ will find all authors on a wikipage so also counting those that are cited within the ‘cite journal’ template)
(2012-01-16: Language correction)
Hollywood actress Natalie Portman I first remarked in the Mike Nichols 2004 film Closer. According to rumor on the Internet a few years before Closer she co-authored a functional neuroimaging scientific article called Frontal lobe activation during object permanence: data from near-infrared spectroscopy. She was attributed as Natalie Hershlag.
I have written before of data mining a co-author graph for the Erdös number and “Hayashi” number, and I wondered if it would be possible to find a co-author path from Portman to me. And indeed yes.
Abigail A. Baird first-authored Portman’s article, and the article Functional magnetic resonance imaging of facial affect recognition in children and adolescents has Abigail Baird and psychiatry professor Bruce M. Cohen among the authors. Bruce M. Cohen and Nicholas Lange is among the co-authors on Structural brain magnetic resonance imaging of limbic and thalamic volumes in pediatric bipolar disorder and Lange and I are linked through our Plurality and resemblance in fMRI data analysis, — an article that contrasted different fMRI analysis methods.
So the co-author path between Portman and me is: Portman – Baird – Cohen – Lange – me, which bring my “Portman number” to 4.
Navigating a graph is a general problem if you only know the local connections. There has even been written scientific articles about it, e.g., Jon Kleinberg‘s Navigating in a small world. When a human (such as I) navigate a social graph such as the co-author graph of scientific articles one can utilize auxillary information, here the information about where a researcher has worked, what his/her interest are and how prominent the researcher is (how many co-authors s/he has). As Portman worked from Harvard a good guess would be to start looking among my co-authors that are near Harvard. Nicholas Lange is from Harvard and we collaborated in the American funded Human Brain Project. I knew that radiology professor Bruce R. Rosen was/is a central figure in Boston MRI researcher, so I thought that there might be a productive connection from him, — both to Lange and to Portman. Portman’s co-author Baird is professor and has written some neuroimaging papers, so among Portman’s co-authors Baird was probably the one that could lead to a path. While searching among Lange and Baird co-authors I confused Bruce Rosen and Bruce Cohen (their Hamming distance is not great). This error proved fertile.
If I didn’t run into Cohen and really wanted to find a path between Portman and me then I think a more automated and brute force method could have been required. One way would be to query PubMed and put the co-author graph into NetworkX which is a Python package. It has a shortest path algorithm. Joe Celko in his book SQL for Smarties: Advanced SQL programming shows a shortest path algorithm in SQL. That might be an alternative to NetworkX.