science

So what can we use Wikicite for?

Posted on Updated on

openfmri-journal-statistics-2016-09-19

Wikicite is a term for the combination of bibliographic information and Wikidata. While Wikipedia often records books of some notability it rarely records bibliographic information of less notability, i.e., individual scientific articles and books where there little third-party information (reviews, literary analyses, etc.) exists. This is not the case with Wikidata. Wikidata is now beginning to record lots of bibliographic information for “lesser works”. What can we use this treasure trove for? Here are a few of my ideas:

  1. Wikidata may be used as a substitute for a reference manager. I record my own bibliographic information in a big BIBTeX file and use the bibtex program together with latex when I generate a scientific document with references. It might very well be that the job of the BIBTeX file with bibliographic information may be taken over by Wikidata. So far we have, to my knowledge, no proper program for extracting the data in Wikidata and formatting it for inclusion in a document. I have begun a “wibtex” program for this, and only reached 44 lines so far, and it remains to be seen whether this is a viable avenue, whether the structure of Wikidata is good and convenient enough to record data for formatting references or that Wikidata is too flexible or too restricted for this kind of application.
  2. Wikidata may be used for “list of publications” of individual researchers, institutions, research groups and sponsor. Nowadays, I keep a list of publication on a webpage, in a latex document and on Google Scholar. My university has a separate list and sometimes when I write a research application I need to format the data for inclusion in a Microsoft Word document. A flexible program on top of Wikidata could make dynamic lists of publications
  3. Wikidata may be used to count citations. During the Wikicite 2016 Berlin meeting I suggested the P2860 property and Tobias quickly created it. The P2860 allows us to describe citations between items in Wikidata. Though we managed to use the property a bit for scientific articles during the meeting, it has really been James Hare that has been running with the ball. Based on public citation data he has added hundreds of thousands of citations. At the moment this is of course only a very small part of the total number of citations. There are probably tens of millions of scientific papers with each having tens, if not hundreds of citations, of citations, so with the 499,750 citations that James Hare reported on 11 September 2016, we are still far from covering the field: James Hare tweeted that Web of Science claims to have over 1 milliard (billion) citations. The citation counts may be compared to a whole range of context data in Wikidata: author, affiliated institution, journal, year of publication, gender of author and sponsor (funding agency), so we can get, e.g., most cited Dane (or one affiliated with a Danish institution), most cited woman with an image, etc.
  4. Wikidata may be used as a hub for information sources. Individual scientific articles may point to further ressources, such as raw or result data. I myself have, for instance, added links to the neuroinformatics databases OpenfMRI, NeuroVault and Neurosynth, where Wikidata records all papers recorded in OpenfMRI, as far as I can determine. Wikidata is then able to list, say, all OpenfMRI papers or all OpenfMRI authors with Magnus Manske’s Listeria tool.
  5. Wikicite information in Wikidata may be used to support claims in Wikidata itself. As Dario Taraborelli points out this would allow queries like “all statements citing journal articles by physicists at Oxford University in the 1970s”.
  6. Wikidata may be used for other scientometrics analyses than counting, e.g, generation of coauthor graphs and cocitation graphs giving context to an author or paper. The bubble chart above shows statistics for journals of papers in OpenfMRI generated with the standard Wikidata Query Service bubble chart visualization tool.
  7. Wikidata could be used for citations in Wikipedia. This may very well be problematic, as a large Wikipedia article could have hundreds of references and each reference needs to be fetched from Wikidata generating lots of traffic. I tried a single citation on the “OpenfMRI” article (it has later been changed). Some form of inclusion of Wikidata identifier in Wikipedia references could further Wikipedia bibliometrics, e.g., determine the most cited author across all Wikipedias.

Neuroinformatics coauthor network – so far

Posted on

neuroinformatics coauthor network 2016-06-28

Screenshot of neuroinformatics coauthor network – so far. Only the big cluster is shown. Network with Jonas Kress default setup querying WDQS.

Page rank of scientific papers with citation in Wikidata – so far

Posted on Updated on

A citation property has just be created a few hours ago, – and as of writing still not been deleted. It means we can describe citation network, e.g., among scientific papers.

So far we have added a few citations, – mostly from papers about Zika. And now we can plot the citation network or compute the network measures such as page rank.

Below is a Python program using everything with Sparql, Pandas and NetworkX:

statement = """
select ?source ?sourceLabel ?target ?targetLabel where {
  ?source wdt:P2860 ?target .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
} 
"""

service = sparql.Service('https://query.wikidata.org/sparql')
response = service.query(statement)
df = DataFrame(response.fetchall(),
    columns=response.variables)

df.sourceLabel = df.sourceLabel.astype(unicode)
df.targetLabel = df.targetLabel.astype(unicode)

g = nx.DiGraph()
g.add_edges_from(((row.sourceLabel, row.targetLabel)
    for n, row in df.iterrows()))

pr = nx.pagerank(g)
sorted_pageranks = sorted((rank, title)
    for title, rank in pr.items())[::-1]

for rank, title in sorted_pageranks[:10]:
    print("{:.4} {}".format(rank, title[:40]))

The result:

0.02647 Genetic and serologic properties of Zika
0.02479 READemption-a tool for the computational
0.02479 Intrauterine West Nile virus: ocular and
0.02479 Internet encyclopaedias go head to head
0.02479 A juvenile early hominin skeleton from D
0.01798 Quantitative real-time PCR detection of 
0.01755 Zika virus. I. Isolations and serologica
0.01755 Genetic characterization of Zika virus s
0.0175 Potential sexual transmission of Zika vi
0.01745 Zika virus in Gabon (Central Africa)--20

Altmetrics for a department

Posted on

Suppose you want to measure the performance of individual researchers of a university department. Which variables can you get hold on and how relevant would they be to measure academic performance?

Here is my take on it:

  1. Google Scholar citations number. Google Scholar records total number of citations, h-index and i10-index as well as the numbers for a fixed period.
  2. Scopus citation numbers.
  3. Twitter. The number of tweets and the number of followers would be relevant.
    One issue here is that the number of tweets may not be relevant to the academic performance and it is also susceptible to manipulation. Interestingly there has been a comparison between Twitter numbers and standard citation counts with a coefficient between the two numbers named the Kardashian index.
  4. Wikidata and Wikipedia presence. Whether Wikidata has a item of the researcher, the number of articles of the researchers, the number of bytes they span, the number of articles recorded in Wikidata. There is an API to get these numbers, and – interestingly – Wikidata can record a range of other identifiers for Google Scholar, Scopus, Twitter, etc. which would make it a convenient open database for keeping track of researcher identifiers across sites of scientometric relevance.
    The number of citations in Wikipedia to the work of a researcher would be interesting to have, but is somewhat more difficult to automatically obtain.
    The numbers of Wikipedia and Wikidata are a bit manipulable.
  5. Stackoverflow/Stackexchange points in relevant areas. The question/answering sites under the Stackexchange umbrella have a range of cites that are of academic interest. In my area, e.g., Stackoverflow and Cross Validated.
  6. GitHub repositories and stars.
  7. Publication download counts. For instance, my department has a repository with papers and the backend keeps track of statistics. The most downloaded papers tend to be introductory or material and overviews.
  8. ResearchGate numbers: Publications, reads, citations and impact points.
  9. ResearcherID (Thomson Reuters) numbers: total articles in publication list, articles with citation data, sum of the time cited, average citations per article, h-index.
  10. Microsoft Academic Search numbers.
  11. Count in the dblp computer science bibliography (the Trier database).
  12. Count of listings in ArXiv.
  13. Counts in Semantic Scholar.
  14. ACM digital library counts.

 

The Brain Number Doctor’s Big Idea

Posted on

Mr. XKCD Randall Munroe has just written about Einsten and his relativity theories with the ten hundred words language.

I few weeks ago I submitted a research application where popular science part of it were written with ten hundred words. It is an interesting exercise to formulate your research without all the jargon and buzzwords.

The English version is here:

Imagine numbers from all studies presented in books and papers put into a computer and carefully put up so everyone from all over the world quickly can see them. So everyone can add new numbers by themselves. So everyone can show the numbers against each other to see if they agree or not between studies. We will study ways to make this possible. We start from brain studies and take out numbers from brain study papers and put them into a computer store. We need to find a way to do this fast and do it exactly. We need to decide on a way to handle the numbers in the computer so taking numbers in and out of the store is easy and the form is easy for others to understand. We need to find a way to make the computer understand whether the studies agree or not. And a way to find what is the cause when they do not agree. And finally we need to find a way to show whether the number agree or not to people from the whole world sitting at their computers.

It was constructed with http://splasho.com/upgoer5/

My Danish translation was:

Forstil dig tal fra alle studier fra artikler lagt ind i en computer, og omhyggeligt lagt op så alle fra hele verden hurtigt kan se dem. Så alle selv kan komme med nye tal. Så alle kan vise tallene op mod hinanden og se om de stemmer overens mellem studierne. Vi vil undersøge måder for at gøre dette muligt. Vi begynder med hjernestudier og tager tal fra artikler om hjernen og lægger dem ind i computeren. Vi vil undersøge hvordan man kan gøre det hurtigt og præcist. Det er nødvendigt at bestemme en måde at håndtere tallene i computeren på, så det er nemt både at få tallene ind og ud. Det er nødvendigt at finde en måde så computeren forstår om studierne stemmer overens eller ikke, og hvad årsagen er hvis de ikke gør. Til sidst vil vi finde en måde at vise om tallene stemme overens så alle folk fra hele verden kan se det fra deres computer.

Strategies of Legitimacy Through Social Media: The Networked Strategy

Posted on

Several years ago we started a research project, Responsible Business in the Blogosphere, together with, among others, members from the Corporate Social Responsibility (CSR) group at the Copenhagen Business School (CBS). The research project looked at social media, companies and their corporate social responsibility. The start of the project coincided with the ascent of Twitter and a number of our research publications from the project considered data analysis of Twitter message. Among them were my A new ANEW: Evaluation of a word list for sentiment analysis in microblogs about the development and evaluation of my sentiment analysis word list AFINN and Good Friends, Bad News – Affect and Virality in Twitter with analysis of information diffusion on Twitter, that is, retweets.

Strategies of Legitimacy Through Social Media: The Networked Strategy is our latest published work in the project. It describes a pharmaceutical company adopting Twitter for communication of CSR-related topics. It is a longitudinal case study with interviews of the people behind the company Twitter account and data mining of tweets. Itziar Castelló, Michael Etter and I authored the paper.

While I did not participate in the interviews nor the interesting analysis of that information, I did a sentiment analysis and topic mining of the tweets that we collected from the company Twitter account and by searching for the company name via the Twitter search API. The results are displayed in Table I and Figure 2 of the paper.

A note from the paper that I find interesting comments on the issues faced by the company as they developed the social media method:

“… the institutional orientation to hierarchical processes requiring approval for all forms of external communication; and the establishment of fixed working hours that ended at 4pm local time coexisting alongside a policy that customer complaints must be resolved within 48 hours, which prevented SED managers from conducting real-time conversations over the Twitter platform.”

Our paper argues for “a new, networked legitimacy strategy” for stackholder engagement in social media with “nonhierarchical, non-regulated participatory relationships”.

Strategies of Legitimacy Through Social Media: The Networked Strategy is available gratis in September 2015.

Review of Val McDermid’s “Forensics: The anatomy of crime”

Posted on Updated on

Val McDermid, apparently an author of some standing as a writer of untrue crime novels, has written a true crime walkthrough of forensics topics interweaving real-life cases and comments. The fine selection of topics has no overall progressive narrative to such an extend that most of the chapters may have been permuted without loss of coherency. If there is a base for the book it is a fascination and awe for modern forensics. She is a good writer. Perhaps her crime novels has trained her in writing clear prose. She delves not into academic technicalities that could perhaps have been interesting.

She has based her book on other books as well as a good number of interviews with a broad range of forensics experts. A few of these comes from the University of Dundee: Forensics chemist Niamh Nic Daéid and forensics antropologist Sue Black.

I find McDermid view of the fallibility of forensics balanced drawing forth cases where presumed experts lack self-critique. Bernard Spilsbury and a U.S. ballistic expert Thomas Quirk are critized. For Roy Meadow, McDermid presents aspects of the tragic Sally Clark case that I do not recall having read before: The appeal was not prompted by Meadow’s evidence but by Pathologist Alan Williams that had failed to disclose blood test results. I do sometimes find popular science writing lack an appropriate level of critique to the material. McDermid is one of the better writers, but I do find one case where she oversteps the confidence we should have in science. Here is what she writes on page 164: “We already know, for instance about the existence of a ‘warrior gene’ – present mainly in men – which is linked with violent and impulsive behaviour under stress”. When I read “We know” I get mad, and when I read ‘warrior gene’ I get extra mad. Behavioral genetics is a mess full of red herrings. Recent meta-analysis of the warrior gene polymorphism MAOA-uVNTR and antisocial behavior (“Candidate Genes for Aggression and Antisocial Behavior: A Meta-analysis of Association Studies of the 5HTTLPR and MAOA-uVNTR“) reaches a 95% confidence interval on 0.98-1.32, while, interesting a very low p-value (0.00000137). The strangeness of difference between confidence interval and p-value is discussed in the paper and presently walks over my head. What seems reasonable certain is the loads of between-study heterogeneity. Any talk of warrior gene needs to acknowledge the uncertainty.

There are certainly more elements to forensics than McDermid presents. A Danish newspaper has recently run a story about cell phone tower records used in courtroom cases. A person carrying a powered cell phone reveals his/her location, – but only with a certain exactness. Cell phones may not necessarily select the nearest cell tower. From my own experience I know that my cell phone can select cell towers in other countries from where I am located, e.g., my cell phone in Nordsjælland in Denmark can easily select a cell tower in Sweden 15 to 20 kilometers or more away and my cell phone in Romania switched to a Ukrainian cell tower perhaps 20 kilometers or more away. U.S state Oregon has seen the case of Lisa Marie Roberts that on her bad lawyer’s advice pleaded guilty in 2004 because of critical important cell tower evidence. In 2013 she was freed.

I was struck by one of the stories presented that originates from the book of criminal lawyer Alex McBride. A surveillance camera records a case of apparently straightforward violence, but McBride is able to get his client off by threatening to use another part of the camera recording showing a policeman mishandling a person in a case of wrongful arrest. The prosecution dropped the charge for the original case. It does not seem fair to the victim of the original crime that the criminal can go free just because another crime is committed. To me it looks like a kind of corruption and extortion.

(Review also available on LibraryThing)