Latest Event Updates

Journalist af karsken bælg: En bog om Lise Nørgaards journalistik af John Chr Jøgensen

Posted on

Godt niveau og levende sprog fra en akademisk herre der har baggrunden i orden som kyndig i kvindelige journalister skrives om nationalklenodiet, Lise i guldsandalerne, og damens mindre fremdragede side fra tiden som mesterlærling frem til stjerneskribentens Pilestrædetid med ekstraordinær ret til ucomputeriseret skrivemaskine. Jørgensen placerer hende som borgerlig individualist med ben i næsen, en ironisk distance og fornyer af journalistgenrer.

På side 38-39 får vi smag på sprogkunsterens evner: Hendes allerførste nu 81 år og et par uger gamle lederartikel fra den 4. januar 1937 i Roskilde Dagblad. Anledning var udenrigspolitiske forviklinger ved et royalt bryllup mellem en hollandsk prinsesse og en tysk prins og her hedder det om naziregeringen at:

“Den har følt, at noget maatte der gøres, og da det ikke var muligt at faa en Finger med i Spillet i selve Holland, vedtoges det at fratage tre tyske Prinsesser af ædleste Blod, der skulde være Brudepiger ved Formælingen af den formastelige Prins og Prinsessen i det Land, hvor en Fodboldkamp med Tyskland kunde foregaa under andet Flag end det med Hagekorset, deres pas. Naa, da Ilterheden havde lagt sig og en Kurér fra den indeklemte Prins havde været hos Hitler, maatte man fra Naziside være blevet klar over, at saadan skulde det alligevel ikke gribes an.” En indskudt dobbelt bisætning med alliterationer og så deres pas og nå!


“En Frygtelig Kvinde” and gender

Posted on Updated on

En Frygtelig kvinde” is a recently premiered Danish film. On this blog I have previously considered how male and female view a film differently: In the case of the Klown movie, there seems to be a slight tendency for female reviewer to be less enthusiastic.

En Frygtelig kvinde” portrays a man and a woman as they fall in love and move together. Keeping in mind the title, “A terrible woman”, would it mean that male film reviewers grade it higher than female reviewers? Below is a small sample – by no means complete – of published film reviews from assorted venues. Danish grades typically go from 1 to 6.

Grade Gender Venue Reviewer
4 Female/Male Berlingske Sarah-Iben Almbjerg, Kristian Lindberg
4 Male BT Michael Lind
5 Male Ekko. Claus Christensen
4 Male Ekstra Bladet Henrik Queitsch
4 Male Filmland P1 Per Juul Carlsen
5 Male Politiken iflg Kim Skotte
 5 Male Soundvenue Rasmus Friis
 4 Male Moovy Elliot Peter Torres
 5 Female Den korte Avis Lone Nørgaard
 2 Male CinemaZone Daniel Skov Madsen
 1 Male Filmz Morten Vejlgaard Just
 4 Female Jyllands-Posten iflg. Katrine Sommer Boysen
 ? (fairly negative, which perhaps can be translated to “3”)  Female Information  Lone Nikolajsen

There are too few reviews for us to make any firm conclusions. A notable issue is two very negative reviews by two males.

A few samples: While Anne-Grethe Bjarup Riis finds it very funny (“skidesjov” og “pissesjov”) the male Filmz reviewer views it as “a misogynist crappy movie” (“en kvindefjendsk lortefilm”).  Two fourth-wave female feminists have opposite views: “a fantastic movie” vs. “really disappointed“.

Even the woman in the movie generates opposite views. The actress, Amanda Collin, are generally praised, but for POV International the character is “faceted” while Louise Kjølsen finds it stereotypical. Lone Nikolajsen characterizes the two main characters as “two well-known sex role clichés”.

According to Ekko, Directory Christian Tafdrup’s previous film sold only 1’603 tickets(!) but was generally praised and received several awards. “En frygtelig kvinde” is produced for just 4 million Danish kroner and the theater was packed when I viewed it.

Code for love: algorithmic dating

Posted on

One of the innovative Danish TV channels, DR3, has a history of dating programs with Gift ved første blik as, I believe, the initial program. A program with – literally – an arranged marriage between to participants matched by what was supposed to be relationship experts. Exported internationally as Married at First Sight the stability of the marriages has been low as very few of the couples have stayed together, – if one is to trust the information on the English Wikipedia.

Now my colleagues at DTU Compute has been involved in a new program called Koden til kærlighed (the code for love). Contrary to Gift ved første blik the participants are not going to get married during the program, but will live together for a month, – and as the perhaps most interesting part – the matches are determined by a learning algorithm: If you view the streamed program of the first episode you will have the delight of seeing glimpses of data mining Python code with Numpy (note the intermixed camelcase and underscore :).

The program seems to have been filmed with smartphone cameras for the most part. The participants are four couples of white heterosexual millenials. So far we have seen their expectations and initial first encounters, – so we are far from knowing whether my colleagues have done a good job with the algorithmic matching.

According to the program, the producers and the Technical University of Denmark have collected information from 1’400 persons in “well-functioning” relationships. There must have been pairs among the 1’400 so the data scientist can train the algorithm using pairs as the positive examples and persons that are not pairs as negative examples. The 350 new singles signed up for the program can then be matched together with the trained algorithm. And four couples of – I suppose – the top ranking matches were selected for the program.

Our Professor Jan Larsen was involved in the program and explained a bit more about the setup in the radio. The collected input was based on responses to 104 questions for 667 couples (apparently not quite 1’400). Important questions may have been related to sleep and education.

It will be interesting to follow the development of the couples. There are 8 episodes in this season. It would have been nice with more technical background: What are the questions? How exactly is the match determined? How is the importance of the questions determined? Has the producers done any “editing” in the relationships? (For instance, why are all participants in the age range 20-25 years?). When people matches how is the answer to the question matching: Are the answers homophilic or heterophilic? During the program there are glimpses of questions, that might have been used. Some examples are “Do you have a tv-set?”, “Which supermarket do you use?”and “How many relationships have you ended?” It is a question whether a question such as “Do you have a tv-set?” is a any use. 667 couples compared to 104 questions are not that much to train a model and one should think that less relevant questions could confuse the algorithm more than it would help.

“Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”

Posted on Updated on

From Peter Brodersen I hear that the budget of the Danish government for next year allocates funds to Dansk Sprognævn for the release of the Retskrivningsordbogen – the Danish official dictionary for word spelling.

It is mentioned briefly in an announcement from the Ministry of Culture: “Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”: 500.000 DKK allocated for the release of the dataset.

It is not clear under which conditions it is released. An announcement from Dansk Sprognævn writes “til sprogteknologiske formål” (to natural language processing purposes). I trust it is not just for natural language processing purposes, – but for every purpose!?

If it is to be used in free software/databases then a CC0 or better license is a good idea. We are still waiting for Wikidata for Wiktionary, the yet waporware with a multilingual, collaborative and structured dictionary. This ressource is CC0-based. The “old” Wiktionary has surprisingly not been used that much by natural language processing researcher. Perhaps because of the anarchistic structure of Wiktionary. Wikidata for Wiktionary could hopefully help with us with structuring lexical data and improve the size and the utility of lexical information. With Retskrivningsordbogen as CC0 it could be imported into Wikidata for Wiktionary and extended with multilingual links and semantic markup.

The problem with Andreas Krause

Posted on Updated on

I first seem to have run into the name “Andreas Krause” in connection with NIPS 2017. Statistics with the Wikidata Query Service shows “Andreas Krause” to be one of the most prolific authors for that particular conference.

But who is “Andreas Krause”?

Google Scholar lists five “Andreas Krause”. An ETH Zürich machine learning researcher, a pharmacometrics researcher, a wood researcher, a economish networkish researcher  working from Bath and a Dresden-based battery/nano researcher. All the NIPS Krause works should likely be attributed to the machine learning researcher, and a read of the works reveal the affiliation to be to ETH Zürich.

An ORCID search reveals six “Andrea Krause”. Three of the Andreas Krause have no or almost no further information about them beyond the name and the ORCID identifier.

There is a Immanuel Krankenhaus Berlin rheumatologist which does not seem to be in Google Scholar.

There may even be more than these six “Andreas Krause”. For instance, the article Emotional Exhaustion and Job Satisfaction in Airport Security Officers – Work–Family Conflict as Mediator in the Job Demands–Resources Model has affiliation with “School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland, Olten, Switzerland”, thus topic and affiliation do not quite fit in with any of the previously mentioned “Andreas Krause”.

One interesting ambiguity is for Classification of rheumatoid joint inflammation based on laser imaging – which obviously is a rheumatology work but also has some machine learning aspects. There is computer scientist/machine learner Volker Tresp as co-author and the work is published in an IEEE venue. There is no affiliation on the “Andreas Krause” on the paper. It is likely the work of the rheumatologist, but you could also guess on the machine learner.

Yet another ambiguity is Biomarker-guided clinical development of the first-in-class anti-inflammatory FPR2/ALX agonist ACT-389949. The topic somewhat overlap between the pharmacokinetics and the domain of the Berlin researcher. The affiliation is to “Clinical Pharmacology, Actelion”, but interestingly, Google Scholar does not associate this paper with the pharmacokinetics researcher.

In conclusion, author disambiguation may be very difficult.

Scholia will can show the six Andreas Krause. But I am not sure that helps us very much.

Text mining in SQL?

Posted on

Are you able to do text mining in SQL?

One problem with SQL is the incompatible functional interfaces. If you read Jonathan Gennick’s book SQL Pocket Guide you note that a lot of effort is going on to explain the differencies between SQL engines. This is particular the case for string functions. A length function may be called LEN og LENGTH, except when it is called LENGTHB, LENGTH2 og LENGTH4. String concatenation may be done by “||” or perhaps CONCAT or “+”.  Substring extraction is with SUBSTRING except when it is SUBSTR. And so on. In conclusion, writing general text mining SQL that fits across SQL engines seems difficult, unless you invent a meta-SQL language and an associated compiler.

But apart from the problem of SQL incompatibility what kind of text mining can be done in SQL? I haven’t run into text mining in SQL before, but I see that here and there you will find some attempts, e.g., Ralph Winters shows tfidf-scaling in his slides Practical Text Mining with SQL using Relational Databases.

Below I will try a very simple word list-based sentiment analysis in SQL. I will use two Danish datasets: “lcc-sentiment” that is my derivation of a Danish text in the Leipzig Corpora Collection and my sentiment word list “AFINN” available in the afinn Python package.

Lets first get some text data into a SQLite database. We use a comma-separated values file from my lcc-sentiment GitHub repository.

import pandas as pd
import sqlite3

url = ""

csv_data = pd.read_csv(url, encoding='utf-8')
# Fix columns name error
csv_data = csv_data.rename(columns={
    csv_data.columns[0]: csv_data.columns[0].strip()})

with sqlite3.connect('text.db') as connection:
    csv_data.to_sql('text', connection)

We will also put an AFINN word list into a table

url = ""
afinn_data = pd.read_csv(url, encoding='utf-8', sep='\t',
                         names=['word', 'sentiment'])

with sqlite3.connect('text.db') as connection:
    afinn_data.to_sql('afinn', connection)

Here we have used the Pandas Python package which very neatly downloads and reads comma- or tab-separated values files and adds the data to a SQL table in very few lines of code.

With the sqlite3 program we can view the 3 first rows in the “text” table constructed from the lcc-sentiment data

sqlite> SELECT * FROM text LIMIT 3;
0|1|0.0|09:05 DR2 Morgen - med Camilla Thorning og Morten Schnell-Lauritzen Nyheder med overblik, baggrund og udsyn.
1|2|2.0|09-10 sæson Spa Francorchamps S2000 Vinter Cup 10. januar 2011, af Ian Andersen Kvalifikation: Det gik nogenlunde, men bilen føltes god.
2|3|0.0|½ time og pensl dem derefter med et sammenpisket æg eller kaffe.

For word tokenization we would like to avoid special characters around word tokens. We can clean the text in this way

SELECT REPLACE(REPLACE(REPLACE(LOWER(text), '.', ' '), ',', ' '), ':', ' ') FROM text;

Now the text is lowercased and mostly separated by spaces. It would have been nice with some form of regular expression substitution here, – we definitely lack a cleaning of some other special characters. As far as I understand regular expression for a replace funtion is not readily available in SQLite, but on Stackoverflow there is a pointer from Vishal Tyagi to an extension implementing the functionality.

Splitting the text is somewhat more complicated and beyond my SQL capabilities. Other developers have run into the problem and suggested solutions on Stackoverflow. They tend to use what Wikipedia calls recursive common table expressions. There is a further complication because we do not only need to split a single string, but “iterate” over all rows with texts.

One Stackoverflow user that had the problem was sigalor and s/he managed to construct a solution. The code below copies and edits sigalor’s code: changes column and table names, make splits on spaces rather than on commas and include the text cleaning with REPLACE (remember Stackoverflow code is CC BY-SA 3.0). sigalor initially constructs another table with the maximum number of texts.

From the returned data I construct a “docterm” table.

-- create temporary table which buffers the maximum article ID, because SELECT MAX can take a very long time on huge databases
INSERT INTO max_text_id VALUES((SELECT MAX(number) FROM text));

WITH RECURSIVE split(text_id, word, str, offsep) AS
    VALUES ( 0, '', '', 0 )
        CASE WHEN offsep==0 OR str IS NULL
            THEN text_id+1 
            ELSE text_id
        CASE WHEN offsep==0 OR str IS NULL
            THEN ''
            ELSE substr(str, 0,
              CASE WHEN instr(str, ' ')
                  THEN instr(str, ' ')
                  ELSE length(str)+1
        CASE WHEN offsep==0 OR str IS NULL
            THEN (SELECT
                    REPLACE(lower(text), '.', ' '), ',', ' '), ':', ' '), '!', ' ')
                  FROM text WHERE number=text_id+1)
            ELSE ltrim(substr(str, instr(str, ' ')), ' ')
        CASE WHEN offsep==0 OR str IS NULL
            THEN 1                    
            ELSE instr(str, ' ')      
        FROM split
        WHERE text_id<=(SELECT * FROM max_text_id)
) SELECT text_id, word FROM split WHERE word != ''; 

I can’t say it ain’t pretty, but we now got a docterm table.

I haven’t checked but I seriously suspect that there are compatibility issues, e.g., with the INSTR function. In Microsoft SQL Server I suppose you would need CHARINDEX instead and a reordering of the input argument to that function. SUBSTR/SUBSTRING is another problem.

We can take a look of the generated docterm matrix table with

sqlite> SELECT * FROM docterm LIMIT 10 OFFSET 10; 



We can now make a sentiment computation

  text_id, SUM(afinn.sentiment) AS sentiment, text.text
FROM docterm, afinn, text
WHERE docterm.word = afinn.word AND docterm.text_id = text.number
GROUP BY text_id
ORDER BY sentiment

Note that this is vanilla SQL.

I suspect there could be performance issues as there is no index on the word columns.

The SQL results in

2892|-11|Du maa have lært, at du ikke formaar at bære livets trængsler og sorger, men at uden Kristus kan du kun synke sammen under byrden i bekymringer og klager og mismod.
2665|-10|Det vil sige med familier, hvor der fx var vold i familien, misbrug, seksuelle overgreb, arbejdsløshed eller mangel på en ordentlig bolig.
360|-9|Angst: To tredjedele helbredes for panikangst inden for otte år, en tredjedel helbredes for social angst og kun en lav andel for generaliseret angst.
5309|-9|I sværere tilfælde mærkes pludselig jagende smerter i musklen (”delvis muskelbristning”, ”fibersprængning”) og i værste fald mærkes et voldsom smæld, hvorefter det er umuligt at bruge musklen (”total muskelbristning”).
7760|-9|Organisationen er kommet i besiddelse af tre videoer, der viser egyptiske fanger, som tilsyneladende er blevet tortureret og derefter dræbt.
8031|-9|Problemer med stress og dårligt psykisk arbejdsmiljø er som oftest et problem for flere faggrupper.
8987|-9|Tabuer blev brudt med opfordringen til afslutningen af Mubaraks styre, og med det eksplicitte krav om at sætte politiets generaler på anklagebænken for tortur og ulovlige arrestationer.
9299|-9|Udbedring af skader som følge af forkert rygning er dyr og besværlig, da rygningen er svært tilgængelig – og ofte kræver stillads før reparationer kan sættes i gang.
9477|-9|Ved at udføre deres angreb på israelske jøder som en del af det større mål at dræbe jøderne, som angivet i Hamas Pagten krænker mange af de palæstinensiske terrorister også Konventionen om at Undgå og Straffe Folkedrab.
1205|-8|Den 12. juli indledte den 21. panserdivision et modangreb mod Trig 33 og Point 24, som blev slået tilbage efter 2½ times kamp, med mere end 600 tyske døde og sårede efterladt strøet ud over området foran de australske stillinger.

This list is the texts that are scored to have the most negative sentiment. In the above ten examples, there seems to be no opinions and most are rather descriptive. They mostly describe various “bad” situations: diseases, families with violence, etc.

If we change to ORDER BY sentiment DESC then we get the text estimated to be positive.

3957|17|God webshop - gode priser og super hurtig forsendelse :) Super god kunde betjening, vi vil klart bestille her fra igen næste gang og anbefaler stedet til andre.
8200|14|ROSE RBS 19 glasses set - KAUFTIPP MountainBIKE 5/2014 ProCycling 04/2014: ProCycling har testet ROSE XEON X-LITE 7000 Konklusion: ROSE cyklen har det hele - den er stabil, super let og giver en hurtig styring.
8610|13|Silkeborg mad ud af huset til en god anledning Til enhver festlig komsammen i Silkeborg området hører festlige gæster, god stemning og god mad sig til.
5356|12|Ja, man ved aldrig hvordan de der glimmermist sprayer, men jeg kan også godt lide tilfældigheden i det. laver nærmest aldrig et LO uden glimmermist :) Og så ELSKER jeg bare Echo Park - det er det FEDESTE mærke Et super dejligt lo.
411|11|Århus-borgmester Nikolaj Vammen er begejstret for VIA University College, TEKOs indtog i Århus: ”Det er en fantastisk god nyhed, at TEKO nu lancerer sine kreative uddannelser inden for mode- og livsstilsbranchen i Århus.
3066|11|Elsker Zoo og er så heldig at jeg bor lige ved siden af, så ungerne nyder rigtig godt at især børneZoo :o) Jeg elsker alle jeres søde kommentarer, og forsøger så vidt muligt at svare på hver enkelt men det tager måske lidt tid.
5483|11|Jeg har bestilt ny cykel, som jeg først får i uge 19. Derfor Håber at I får en fantastisk, sjov og lærerig dag Regn er bare naturen, der sveder tilbage på os Administratoren har deaktiveret offentlig skrive adgang.
8180|11|Rigtig fint alternativ til en helt almindelig pailletjakke med de meget fine mønstre :-) Er sikker på, at du nok skal style den godt – og jeg ser frem til at få et kig med ;-) 21. august 2013 at 14:27 Tak, det er skønt at høre!
9918|11|Vi vil hermed sige tak for en rigtig god aften, med en god musik og god DJ.
84|10|34 kommentarer Everything Se indlæg isabellathordsen I sidste uge købte jeg disse fantastiske solbriller og denne smukke smukke russiske ring.

The first one is a very positive (apparent) review of a webshop (“Good web shop – good prices and super quick delivery …”), the second a praise of a bike. Most – if not all – of the 10 texts seems to be positive.

Text mining may not be particular convenient in SQL, but might be a “possible” option if there are problems with interfacing to languages that are more suitable to text mining.


Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table



import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
response = requests.get("",
                        params={'query': query, 'format': 'json'})
researchers =['results']['bindings'])

URL = ""
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
    url = URL + github
        response = requests.get(url,
        user_followers = response.json()['followers']
        user_followers = 0
    print("{} {}".format(github, followers))

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))


The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan
1299 jesstess Jessica McKellar
475 triketora Tracy Chou
347 olgabot Olga B. Botvinnik
124 vsoch Vanessa V. Sochat
84 brainwane Sumana Harihareswara
75 lydiapintscher Lydia Pintscher
56 agbeltran Alejandra González-Beltrán
22 frimelle Lucie-Aimée Kaffee
21 isabelleaugenstein Isabelle Augenstein
20 cnap Courtney Napoles
15 tudorache Tania Tudorache
13 vedina Nina Jeliazkova
11 mkutmon Martina Summer-Kutmon
7 caoyler Catalina Wilmers
7 esterpantaleo Ester Pantaleo
6 NuriaQueralt Núria Queralt Rosinach
2 rongwangnu Rong Wang
2 lschiff Lisa Schiff
1 SigridK Sigrid Klerke
1 amrapalijz Amrapali Zaveri
1 mesbahs Sepideh Mesbah
1 ChristineChichester Christine Chichester
1 BinaryStars Shima Dastgheib
1 mollymking Molly M. King
0 jannahastings Janna Hastings
0 nmjakobsen Nina Munkholt Jakobsen