Month: December 2017

“Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”

Posted on Updated on

From Peter Brodersen I hear that the budget of the Danish government for next year allocates funds to Dansk Sprognævn for the release of the Retskrivningsordbogen – the Danish official dictionary for word spelling.

It is mentioned briefly in an announcement from the Ministry of Culture: “Og så er der fra 2018 og frem øremærket 0,5 mio. kr. til Dansk Sprognævn til at frikøbe Retskrivningsordbogen.”: 500.000 DKK allocated for the release of the dataset.

It is not clear under which conditions it is released. An announcement from Dansk Sprognævn writes “til sprogteknologiske formål” (to natural language processing purposes). I trust it is not just for natural language processing purposes, – but for every purpose!?

If it is to be used in free software/databases then a CC0 or better license is a good idea. We are still waiting for Wikidata for Wiktionary, the yet waporware with a multilingual, collaborative and structured dictionary. This ressource is CC0-based. The “old” Wiktionary has surprisingly not been used that much by natural language processing researcher. Perhaps because of the anarchistic structure of Wiktionary. Wikidata for Wiktionary could hopefully help with us with structuring lexical data and improve the size and the utility of lexical information. With Retskrivningsordbogen as CC0 it could be imported into Wikidata for Wiktionary and extended with multilingual links and semantic markup.

The problem with Andreas Krause

Posted on Updated on

I first seem to have run into the name “Andreas Krause” in connection with NIPS 2017. Statistics with the Wikidata Query Service shows “Andreas Krause” to be one of the most prolific authors for that particular conference.

But who is “Andreas Krause”?

Google Scholar lists five “Andreas Krause”. An ETH Zürich machine learning researcher, a pharmacometrics researcher, a wood researcher, a economish networkish researcher  working from Bath and a Dresden-based battery/nano researcher. All the NIPS Krause works should likely be attributed to the machine learning researcher, and a read of the works reveal the affiliation to be to ETH Zürich.

An ORCID search reveals six “Andrea Krause”. Three of the Andreas Krause have no or almost no further information about them beyond the name and the ORCID identifier.

There is a Immanuel Krankenhaus Berlin rheumatologist which does not seem to be in Google Scholar.

There may even be more than these six “Andreas Krause”. For instance, the article Emotional Exhaustion and Job Satisfaction in Airport Security Officers – Work–Family Conflict as Mediator in the Job Demands–Resources Model has affiliation with “School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland, Olten, Switzerland”, thus topic and affiliation do not quite fit in with any of the previously mentioned “Andreas Krause”.

One interesting ambiguity is for Classification of rheumatoid joint inflammation based on laser imaging – which obviously is a rheumatology work but also has some machine learning aspects. There is computer scientist/machine learner Volker Tresp as co-author and the work is published in an IEEE venue. There is no affiliation on the “Andreas Krause” on the paper. It is likely the work of the rheumatologist, but you could also guess on the machine learner.

Yet another ambiguity is Biomarker-guided clinical development of the first-in-class anti-inflammatory FPR2/ALX agonist ACT-389949. The topic somewhat overlap between the pharmacokinetics and the domain of the Berlin researcher. The affiliation is to “Clinical Pharmacology, Actelion”, but interestingly, Google Scholar does not associate this paper with the pharmacokinetics researcher.

In conclusion, author disambiguation may be very difficult.

Scholia will can show the six Andreas Krause. But I am not sure that helps us very much.

Text mining in SQL?

Posted on

Are you able to do text mining in SQL?

One problem with SQL is the incompatible functional interfaces. If you read Jonathan Gennick’s book SQL Pocket Guide you note that a lot of effort is going on to explain the differencies between SQL engines. This is particular the case for string functions. A length function may be called LEN og LENGTH, except when it is called LENGTHB, LENGTH2 og LENGTH4. String concatenation may be done by “||” or perhaps CONCAT or “+”.  Substring extraction is with SUBSTRING except when it is SUBSTR. And so on. In conclusion, writing general text mining SQL that fits across SQL engines seems difficult, unless you invent a meta-SQL language and an associated compiler.

But apart from the problem of SQL incompatibility what kind of text mining can be done in SQL? I haven’t run into text mining in SQL before, but I see that here and there you will find some attempts, e.g., Ralph Winters shows tfidf-scaling in his slides Practical Text Mining with SQL using Relational Databases.

Below I will try a very simple word list-based sentiment analysis in SQL. I will use two Danish datasets: “lcc-sentiment” that is my derivation of a Danish text in the Leipzig Corpora Collection and my sentiment word list “AFINN” available in the afinn Python package.

Lets first get some text data into a SQLite database. We use a comma-separated values file from my lcc-sentiment GitHub repository.

import pandas as pd
import sqlite3

url = ""

csv_data = pd.read_csv(url, encoding='utf-8')
# Fix columns name error
csv_data = csv_data.rename(columns={
    csv_data.columns[0]: csv_data.columns[0].strip()})

with sqlite3.connect('text.db') as connection:
    csv_data.to_sql('text', connection)

We will also put an AFINN word list into a table

url = ""
afinn_data = pd.read_csv(url, encoding='utf-8', sep='\t',
                         names=['word', 'sentiment'])

with sqlite3.connect('text.db') as connection:
    afinn_data.to_sql('afinn', connection)

Here we have used the Pandas Python package which very neatly downloads and reads comma- or tab-separated values files and adds the data to a SQL table in very few lines of code.

With the sqlite3 program we can view the 3 first rows in the “text” table constructed from the lcc-sentiment data

sqlite> SELECT * FROM text LIMIT 3;
0|1|0.0|09:05 DR2 Morgen - med Camilla Thorning og Morten Schnell-Lauritzen Nyheder med overblik, baggrund og udsyn.
1|2|2.0|09-10 sæson Spa Francorchamps S2000 Vinter Cup 10. januar 2011, af Ian Andersen Kvalifikation: Det gik nogenlunde, men bilen føltes god.
2|3|0.0|½ time og pensl dem derefter med et sammenpisket æg eller kaffe.

For word tokenization we would like to avoid special characters around word tokens. We can clean the text in this way

SELECT REPLACE(REPLACE(REPLACE(LOWER(text), '.', ' '), ',', ' '), ':', ' ') FROM text;

Now the text is lowercased and mostly separated by spaces. It would have been nice with some form of regular expression substitution here, – we definitely lack a cleaning of some other special characters. As far as I understand regular expression for a replace funtion is not readily available in SQLite, but on Stackoverflow there is a pointer from Vishal Tyagi to an extension implementing the functionality.

Splitting the text is somewhat more complicated and beyond my SQL capabilities. Other developers have run into the problem and suggested solutions on Stackoverflow. They tend to use what Wikipedia calls recursive common table expressions. There is a further complication because we do not only need to split a single string, but “iterate” over all rows with texts.

One Stackoverflow user that had the problem was sigalor and s/he managed to construct a solution. The code below copies and edits sigalor’s code: changes column and table names, make splits on spaces rather than on commas and include the text cleaning with REPLACE (remember Stackoverflow code is CC BY-SA 3.0). sigalor initially constructs another table with the maximum number of texts.

From the returned data I construct a “docterm” table.

-- create temporary table which buffers the maximum article ID, because SELECT MAX can take a very long time on huge databases
INSERT INTO max_text_id VALUES((SELECT MAX(number) FROM text));

WITH RECURSIVE split(text_id, word, str, offsep) AS
    VALUES ( 0, '', '', 0 )
        CASE WHEN offsep==0 OR str IS NULL
            THEN text_id+1 
            ELSE text_id
        CASE WHEN offsep==0 OR str IS NULL
            THEN ''
            ELSE substr(str, 0,
              CASE WHEN instr(str, ' ')
                  THEN instr(str, ' ')
                  ELSE length(str)+1
        CASE WHEN offsep==0 OR str IS NULL
            THEN (SELECT
                    REPLACE(lower(text), '.', ' '), ',', ' '), ':', ' '), '!', ' ')
                  FROM text WHERE number=text_id+1)
            ELSE ltrim(substr(str, instr(str, ' ')), ' ')
        CASE WHEN offsep==0 OR str IS NULL
            THEN 1                    
            ELSE instr(str, ' ')      
        FROM split
        WHERE text_id<=(SELECT * FROM max_text_id)
) SELECT text_id, word FROM split WHERE word != ''; 

I can’t say it ain’t pretty, but we now got a docterm table.

I haven’t checked but I seriously suspect that there are compatibility issues, e.g., with the INSTR function. In Microsoft SQL Server I suppose you would need CHARINDEX instead and a reordering of the input argument to that function. SUBSTR/SUBSTRING is another problem.

We can take a look of the generated docterm matrix table with

sqlite> SELECT * FROM docterm LIMIT 10 OFFSET 10; 



We can now make a sentiment computation

  text_id, SUM(afinn.sentiment) AS sentiment, text.text
FROM docterm, afinn, text
WHERE docterm.word = afinn.word AND docterm.text_id = text.number
GROUP BY text_id
ORDER BY sentiment

Note that this is vanilla SQL.

I suspect there could be performance issues as there is no index on the word columns.

The SQL results in

2892|-11|Du maa have lært, at du ikke formaar at bære livets trængsler og sorger, men at uden Kristus kan du kun synke sammen under byrden i bekymringer og klager og mismod.
2665|-10|Det vil sige med familier, hvor der fx var vold i familien, misbrug, seksuelle overgreb, arbejdsløshed eller mangel på en ordentlig bolig.
360|-9|Angst: To tredjedele helbredes for panikangst inden for otte år, en tredjedel helbredes for social angst og kun en lav andel for generaliseret angst.
5309|-9|I sværere tilfælde mærkes pludselig jagende smerter i musklen (”delvis muskelbristning”, ”fibersprængning”) og i værste fald mærkes et voldsom smæld, hvorefter det er umuligt at bruge musklen (”total muskelbristning”).
7760|-9|Organisationen er kommet i besiddelse af tre videoer, der viser egyptiske fanger, som tilsyneladende er blevet tortureret og derefter dræbt.
8031|-9|Problemer med stress og dårligt psykisk arbejdsmiljø er som oftest et problem for flere faggrupper.
8987|-9|Tabuer blev brudt med opfordringen til afslutningen af Mubaraks styre, og med det eksplicitte krav om at sætte politiets generaler på anklagebænken for tortur og ulovlige arrestationer.
9299|-9|Udbedring af skader som følge af forkert rygning er dyr og besværlig, da rygningen er svært tilgængelig – og ofte kræver stillads før reparationer kan sættes i gang.
9477|-9|Ved at udføre deres angreb på israelske jøder som en del af det større mål at dræbe jøderne, som angivet i Hamas Pagten krænker mange af de palæstinensiske terrorister også Konventionen om at Undgå og Straffe Folkedrab.
1205|-8|Den 12. juli indledte den 21. panserdivision et modangreb mod Trig 33 og Point 24, som blev slået tilbage efter 2½ times kamp, med mere end 600 tyske døde og sårede efterladt strøet ud over området foran de australske stillinger.

This list is the texts that are scored to have the most negative sentiment. In the above ten examples, there seems to be no opinions and most are rather descriptive. They mostly describe various “bad” situations: diseases, families with violence, etc.

If we change to ORDER BY sentiment DESC then we get the text estimated to be positive.

3957|17|God webshop - gode priser og super hurtig forsendelse :) Super god kunde betjening, vi vil klart bestille her fra igen næste gang og anbefaler stedet til andre.
8200|14|ROSE RBS 19 glasses set - KAUFTIPP MountainBIKE 5/2014 ProCycling 04/2014: ProCycling har testet ROSE XEON X-LITE 7000 Konklusion: ROSE cyklen har det hele - den er stabil, super let og giver en hurtig styring.
8610|13|Silkeborg mad ud af huset til en god anledning Til enhver festlig komsammen i Silkeborg området hører festlige gæster, god stemning og god mad sig til.
5356|12|Ja, man ved aldrig hvordan de der glimmermist sprayer, men jeg kan også godt lide tilfældigheden i det. laver nærmest aldrig et LO uden glimmermist :) Og så ELSKER jeg bare Echo Park - det er det FEDESTE mærke Et super dejligt lo.
411|11|Århus-borgmester Nikolaj Vammen er begejstret for VIA University College, TEKOs indtog i Århus: ”Det er en fantastisk god nyhed, at TEKO nu lancerer sine kreative uddannelser inden for mode- og livsstilsbranchen i Århus.
3066|11|Elsker Zoo og er så heldig at jeg bor lige ved siden af, så ungerne nyder rigtig godt at især børneZoo :o) Jeg elsker alle jeres søde kommentarer, og forsøger så vidt muligt at svare på hver enkelt men det tager måske lidt tid.
5483|11|Jeg har bestilt ny cykel, som jeg først får i uge 19. Derfor Håber at I får en fantastisk, sjov og lærerig dag Regn er bare naturen, der sveder tilbage på os Administratoren har deaktiveret offentlig skrive adgang.
8180|11|Rigtig fint alternativ til en helt almindelig pailletjakke med de meget fine mønstre :-) Er sikker på, at du nok skal style den godt – og jeg ser frem til at få et kig med ;-) 21. august 2013 at 14:27 Tak, det er skønt at høre!
9918|11|Vi vil hermed sige tak for en rigtig god aften, med en god musik og god DJ.
84|10|34 kommentarer Everything Se indlæg isabellathordsen I sidste uge købte jeg disse fantastiske solbriller og denne smukke smukke russiske ring.

The first one is a very positive (apparent) review of a webshop (“Good web shop – good prices and super quick delivery …”), the second a praise of a bike. Most – if not all – of the 10 texts seems to be positive.

Text mining may not be particular convenient in SQL, but might be a “possible” option if there are problems with interfacing to languages that are more suitable to text mining.