programming

Text mining in SQL?

Posted on

Are you able to do text mining in SQL?

One problem with SQL is the incompatible functional interfaces. If you read Jonathan Gennick’s book SQL Pocket Guide you note that a lot of effort is going on to explain the differencies between SQL engines. This is particular the case for string functions. A length function may be called LEN og LENGTH, except when it is called LENGTHB, LENGTH2 og LENGTH4. String concatenation may be done by “||” or perhaps CONCAT or “+”.  Substring extraction is with SUBSTRING except when it is SUBSTR. And so on. In conclusion, writing general text mining SQL that fits across SQL engines seems difficult, unless you invent a meta-SQL language and an associated compiler.

But apart from the problem of SQL incompatibility what kind of text mining can be done in SQL? I haven’t run into text mining in SQL before, but I see that here and there you will find some attempts, e.g., Ralph Winters shows tfidf-scaling in his slides Practical Text Mining with SQL using Relational Databases.

Below I will try a very simple word list-based sentiment analysis in SQL. I will use two Danish datasets: “lcc-sentiment” that is my derivation of a Danish text in the Leipzig Corpora Collection and my sentiment word list “AFINN” available in the afinn Python package.

Lets first get some text data into a SQLite database. We use a comma-separated values file from my lcc-sentiment GitHub repository.

import pandas as pd
import sqlite3

url = "https://raw.githubusercontent.com/fnielsen/lcc-sentiment/master/dan_mixed_2014_10K-sentences.csv"

csv_data = pd.read_csv(url, encoding='utf-8')
# Fix columns name error
csv_data = csv_data.rename(columns={
    csv_data.columns[0]: csv_data.columns[0].strip()})

with sqlite3.connect('text.db') as connection:
    csv_data.to_sql('text', connection)

We will also put an AFINN word list into a table

url = "https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-da-32.txt"
afinn_data = pd.read_csv(url, encoding='utf-8', sep='\t',
                         names=['word', 'sentiment'])

with sqlite3.connect('text.db') as connection:
    afinn_data.to_sql('afinn', connection)

Here we have used the Pandas Python package which very neatly downloads and reads comma- or tab-separated values files and adds the data to a SQL table in very few lines of code.

With the sqlite3 program we can view the 3 first rows in the “text” table constructed from the lcc-sentiment data

sqlite> SELECT * FROM text LIMIT 3;
0|1|0.0|09:05 DR2 Morgen - med Camilla Thorning og Morten Schnell-Lauritzen Nyheder med overblik, baggrund og udsyn.
1|2|2.0|09-10 sæson Spa Francorchamps S2000 Vinter Cup 10. januar 2011, af Ian Andersen Kvalifikation: Det gik nogenlunde, men bilen føltes god.
2|3|0.0|½ time og pensl dem derefter med et sammenpisket æg eller kaffe.

For word tokenization we would like to avoid special characters around word tokens. We can clean the text in this way

SELECT REPLACE(REPLACE(REPLACE(LOWER(text), '.', ' '), ',', ' '), ':', ' ') FROM text;

Now the text is lowercased and mostly separated by spaces. It would have been nice with some form of regular expression substitution here, – we definitely lack a cleaning of some other special characters. As far as I understand regular expression for a replace funtion is not readily available in SQLite, but on Stackoverflow there is a pointer from Vishal Tyagi to an extension implementing the functionality.

Splitting the text is somewhat more complicated and beyond my SQL capabilities. Other developers have run into the problem and suggested solutions on Stackoverflow. They tend to use what Wikipedia calls recursive common table expressions. There is a further complication because we do not only need to split a single string, but “iterate” over all rows with texts.

One Stackoverflow user that had the problem was sigalor and s/he managed to construct a solution. The code below copies and edits sigalor’s code: changes column and table names, make splits on spaces rather than on commas and include the text cleaning with REPLACE (remember Stackoverflow code is CC BY-SA 3.0). sigalor initially constructs another table with the maximum number of texts.

From the returned data I construct a “docterm” table.

-- create temporary table which buffers the maximum article ID, because SELECT MAX can take a very long time on huge databases
DROP TABLE IF EXISTS max_text_id;
CREATE TEMP TABLE max_text_id(num INTEGER);
INSERT INTO max_text_id VALUES((SELECT MAX(number) FROM text));

DROP TABLE IF EXISTS docterm;
CREATE TABLE docterm AS 
WITH RECURSIVE split(text_id, word, str, offsep) AS
(
    VALUES ( 0, '', '', 0 )
    UNION ALL
    SELECT
        CASE WHEN offsep==0 OR str IS NULL
            THEN text_id+1 
            ELSE text_id
        END,
        CASE WHEN offsep==0 OR str IS NULL
            THEN ''
            ELSE substr(str, 0,
              CASE WHEN instr(str, ' ')
                  THEN instr(str, ' ')
                  ELSE length(str)+1
              END)
        END,
        CASE WHEN offsep==0 OR str IS NULL
            THEN (SELECT
                    REPLACE(
                    REPLACE(
                    REPLACE(
                    REPLACE(lower(text), '.', ' '), ',', ' '), ':', ' '), '!', ' ')
                  FROM text WHERE number=text_id+1)
            ELSE ltrim(substr(str, instr(str, ' ')), ' ')
        END,
        CASE WHEN offsep==0 OR str IS NULL
            THEN 1                    
            ELSE instr(str, ' ')      
        END
        FROM split
        WHERE text_id<=(SELECT * FROM max_text_id)
) SELECT text_id, word FROM split WHERE word != ''; 

I can’t say it ain’t pretty, but we now got a docterm table.

I haven’t checked but I seriously suspect that there are compatibility issues, e.g., with the INSTR function. In Microsoft SQL Server I suppose you would need CHARINDEX instead and a reordering of the input argument to that function. SUBSTR/SUBSTRING is another problem.

We can take a look of the generated docterm matrix table with

sqlite> SELECT * FROM docterm LIMIT 10 OFFSET 10; 

yielding

1|schnell-lauritzen
1|nyheder
1|med
1|overblik
1|baggrund
1|og
1|udsyn
2|09-10
2|sæson
2|spa

We can now make a sentiment computation

SELECT
  text_id, SUM(afinn.sentiment) AS sentiment, text.text
FROM docterm, afinn, text
WHERE docterm.word = afinn.word AND docterm.text_id = text.number
GROUP BY text_id
ORDER BY sentiment
LIMIT 10;

Note that this is vanilla SQL.

I suspect there could be performance issues as there is no index on the word columns.

The SQL results in

2892|-11|Du maa have lært, at du ikke formaar at bære livets trængsler og sorger, men at uden Kristus kan du kun synke sammen under byrden i bekymringer og klager og mismod.
2665|-10|Det vil sige med familier, hvor der fx var vold i familien, misbrug, seksuelle overgreb, arbejdsløshed eller mangel på en ordentlig bolig.
360|-9|Angst: To tredjedele helbredes for panikangst inden for otte år, en tredjedel helbredes for social angst og kun en lav andel for generaliseret angst.
5309|-9|I sværere tilfælde mærkes pludselig jagende smerter i musklen (”delvis muskelbristning”, ”fibersprængning”) og i værste fald mærkes et voldsom smæld, hvorefter det er umuligt at bruge musklen (”total muskelbristning”).
7760|-9|Organisationen er kommet i besiddelse af tre videoer, der viser egyptiske fanger, som tilsyneladende er blevet tortureret og derefter dræbt.
8031|-9|Problemer med stress og dårligt psykisk arbejdsmiljø er som oftest et problem for flere faggrupper.
8987|-9|Tabuer blev brudt med opfordringen til afslutningen af Mubaraks styre, og med det eksplicitte krav om at sætte politiets generaler på anklagebænken for tortur og ulovlige arrestationer.
9299|-9|Udbedring af skader som følge af forkert rygning er dyr og besværlig, da rygningen er svært tilgængelig – og ofte kræver stillads før reparationer kan sættes i gang.
9477|-9|Ved at udføre deres angreb på israelske jøder som en del af det større mål at dræbe jøderne, som angivet i Hamas Pagten krænker mange af de palæstinensiske terrorister også Konventionen om at Undgå og Straffe Folkedrab.
1205|-8|Den 12. juli indledte den 21. panserdivision et modangreb mod Trig 33 og Point 24, som blev slået tilbage efter 2½ times kamp, med mere end 600 tyske døde og sårede efterladt strøet ud over området foran de australske stillinger.

This list is the texts that are scored to have the most negative sentiment. In the above ten examples, there seems to be no opinions and most are rather descriptive. They mostly describe various “bad” situations: diseases, families with violence, etc.

If we change to ORDER BY sentiment DESC then we get the text estimated to be positive.

3957|17|God webshop - gode priser og super hurtig forsendelse :) Super god kunde betjening, vi vil klart bestille her fra igen næste gang og anbefaler stedet til andre.
8200|14|ROSE RBS 19 glasses set - KAUFTIPP MountainBIKE 5/2014 ProCycling 04/2014: ProCycling har testet ROSE XEON X-LITE 7000 Konklusion: ROSE cyklen har det hele - den er stabil, super let og giver en hurtig styring.
8610|13|Silkeborg mad ud af huset til en god anledning Til enhver festlig komsammen i Silkeborg området hører festlige gæster, god stemning og god mad sig til.
5356|12|Ja, man ved aldrig hvordan de der glimmermist sprayer, men jeg kan også godt lide tilfældigheden i det. laver nærmest aldrig et LO uden glimmermist :) Og så ELSKER jeg bare Echo Park - det er det FEDESTE mærke Et super dejligt lo.
411|11|Århus-borgmester Nikolaj Vammen er begejstret for VIA University College, TEKOs indtog i Århus: ”Det er en fantastisk god nyhed, at TEKO nu lancerer sine kreative uddannelser inden for mode- og livsstilsbranchen i Århus.
3066|11|Elsker Zoo og er så heldig at jeg bor lige ved siden af, så ungerne nyder rigtig godt at især børneZoo :o) Jeg elsker alle jeres søde kommentarer, og forsøger så vidt muligt at svare på hver enkelt men det tager måske lidt tid.
5483|11|Jeg har bestilt ny cykel, som jeg først får i uge 19. Derfor Håber at I får en fantastisk, sjov og lærerig dag Regn er bare naturen, der sveder tilbage på os Administratoren har deaktiveret offentlig skrive adgang.
8180|11|Rigtig fint alternativ til en helt almindelig pailletjakke med de meget fine mønstre :-) Er sikker på, at du nok skal style den godt – og jeg ser frem til at få et kig med ;-) 21. august 2013 at 14:27 Tak, det er skønt at høre!
9918|11|Vi vil hermed sige tak for en rigtig god aften, med en god musik og god DJ.
84|10|34 kommentarer Everything Se indlæg isabellathordsen I sidste uge købte jeg disse fantastiske solbriller og denne smukke smukke russiske ring.

The first one is a very positive (apparent) review of a webshop (“Good web shop – good prices and super quick delivery …”), the second a praise of a bike. Most – if not all – of the 10 texts seems to be positive.

Text mining may not be particular convenient in SQL, but might be a “possible” option if there are problems with interfacing to languages that are more suitable to text mining.

 

Advertisements

Female GitHubbers

Posted on Updated on

In Wikidata, we can record the GitHub user name with the P2037 property. As we typically also has the gender of the person we can make a SPARQL query that yields all female GitHub users recorded in Wikidata. There ain’t no many. Currently just 27.

The Python code below gets the SPARQL results into a Python Pandas DataFrame and queries the GitHub API for followers count and adds the information to a dataframe column. Then we can rank the female GitHub users according to follower count and format the results in a HTML table

Code

 

import re
import requests
import pandas as pd

query = """
SELECT ?researcher ?researcherLabel ?github ?github_url WHERE {
  ?researcher wdt:P21 wd:Q6581072 .
  ?researcher wdt:P2037 ?github .
  BIND(URI(CONCAT("https://github.com/", ?github)) AS ?github_url)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
response = requests.get("https://query.wikidata.org/sparql",
                        params={'query': query, 'format': 'json'})
researchers = pd.io.json.json_normalize(response.json()['results']['bindings'])

URL = "https://api.github.com/users/"
followers = []
for github in researchers['github.value']:
    if not re.match('^[a-zA-Z0-9]+$', github):
        followers.append(0)
        continue
    url = URL + github
    try:
        response = requests.get(url,
                                headers={'Accept':'application/vnd.github.v3+json'})
        user_followers = response.json()['followers']
    except: 
        user_followers = 0
    followers.append(user_followers)
    print("{} {}".format(github, followers))
    sleep(5)

researchers['followers'] = followers

columns = ['followers', 'github.value', 'researcherLabel.value',
           'researcher.value']
print(researchers.sort(columns=['followers'], ascending=False)[columns].to_html(index=False))

Results

The top one is Jennifer Bryan, a Vancouver statistician, that I do not know much about, but she seems to be involved in R-studio.

Number two is Jessica McKellar is a well-known figure in the Python community. Number four and five, Olga Botvinnik and Vanessa Sochat, are bioinformatician and neuroinformatician, respectively (or was: Sochat has apparently left the Poldrack lab in 2016 according to her CV). Further down the list we have people from the wikiworld, Sumana Harihareswara, Lydia Pintscher and Lucie-Aimée Kaffee.

I was surprised to see that Isis Agora Lovecruft is not there, but there is no Wikidata item representing her. She would have been number three.

Jennifer Bryan and Vanessa Sochat are almost “all-greeners”. Sochat has just a single non-green day.

I suppose the Wikidata GitHub information is far from complete, so this analysis is quite limited.

followers github.value researcherLabel.value researcher.value
1675 jennybc Jennifer Bryan http://www.wikidata.org/entity/Q40579104
1299 jesstess Jessica McKellar http://www.wikidata.org/entity/Q19667922
475 triketora Tracy Chou http://www.wikidata.org/entity/Q24238925
347 olgabot Olga B. Botvinnik http://www.wikidata.org/entity/Q44163048
124 vsoch Vanessa V. Sochat http://www.wikidata.org/entity/Q30133235
84 brainwane Sumana Harihareswara http://www.wikidata.org/entity/Q18912181
75 lydiapintscher Lydia Pintscher http://www.wikidata.org/entity/Q18016466
56 agbeltran Alejandra González-Beltrán http://www.wikidata.org/entity/Q27824575
22 frimelle Lucie-Aimée Kaffee http://www.wikidata.org/entity/Q37860261
21 isabelleaugenstein Isabelle Augenstein http://www.wikidata.org/entity/Q30338957
20 cnap Courtney Napoles http://www.wikidata.org/entity/Q42797251
15 tudorache Tania Tudorache http://www.wikidata.org/entity/Q29053249
13 vedina Nina Jeliazkova http://www.wikidata.org/entity/Q27061849
11 mkutmon Martina Summer-Kutmon http://www.wikidata.org/entity/Q27987764
7 caoyler Catalina Wilmers http://www.wikidata.org/entity/Q38915853
7 esterpantaleo Ester Pantaleo http://www.wikidata.org/entity/Q28949490
6 NuriaQueralt Núria Queralt Rosinach http://www.wikidata.org/entity/Q29644228
2 rongwangnu Rong Wang http://www.wikidata.org/entity/Q35178434
2 lschiff Lisa Schiff http://www.wikidata.org/entity/Q38916007
1 SigridK Sigrid Klerke http://www.wikidata.org/entity/Q28152723
1 amrapalijz Amrapali Zaveri http://www.wikidata.org/entity/Q34315853
1 mesbahs Sepideh Mesbah http://www.wikidata.org/entity/Q30098458
1 ChristineChichester Christine Chichester http://www.wikidata.org/entity/Q19845665
1 BinaryStars Shima Dastgheib http://www.wikidata.org/entity/Q42091042
1 mollymking Molly M. King http://www.wikidata.org/entity/Q40705344
0 jannahastings Janna Hastings http://www.wikidata.org/entity/Q27902110
0 nmjakobsen Nina Munkholt Jakobsen http://www.wikidata.org/entity/Q38674430

HACK4DK 2017

Posted on Updated on

The HACK4DK is an annual event in Copenhagen, bringing together cultural nerds and computer nerds for building interesting things with cultural data. I have been participating since the very beginning and participated in this year’s HACK4DK which took place at ENIGMA, a to-be museum in Østerbro, Copenhagen.

The winning project among around 19 projects this year was Tin Toy, a neat augmented reality application using images from the toy collection of Holstebro Museum. I believe they used the AR.js Javascript library. There is a YouTube video that attempts to capture the attractiveness of the project:

The result of my struggles with the a-frame Javascript library is available on this page: https://fnielsen.github.io/hack4dk2017/. Under the name “Virtual Gallery of Denmark” it was suppose to be a virtual reality environment with presentation of Danish art. The end result became a somewhat less dynamic but meditative environment with textured panels flying around in a virtual environment and with sound from old rerecorded phonographs in the Ruben Collection made available by the Royal Library in Aarhus.

virtual-gallery-of-denmark

I did not rely on the data provided at the event, but used data from the cultural institutions that were already uploaded to Wikimedia Commons and where the metadata was described on Wikidata. Both the images of the paintings (which was from Skagens Museum) and the sound were available at Wikimedia Commons and well-annotated on Wikidata.

The images was fetched with SPARQL queries to the Wikidata Query Service and API calls to the Wikimedia Commons API, and as such it is fairly easy to change the virtual environment to use other files which I did afterwards: The Giersing-Bach-Ishizaka-Nielsen virtual environment uses images on Wikimedia Commons where Wikidata records the artist as being Harald Giersing. Here the sound is from the Kimiko Ishizaka‘s Open Goldberg Variations project.

virtual-giersing.png

While a-frame models are suppose to run straight from the web browser on smartphones, my models seem to have hefty hardware requirements, – the images have quite high resolutions. It takes over 10 seconds on my computer to download all the image and sound files associated with the models. Nevertheless, with a strong computer, a big screen and good headphones, it is quite interesting to view and hear as the paintings and sound fly by.

How to quickly generate word analogy datasets with Wikidata

Posted on Updated on

One popular task in computational linguistics/natural language processing is the word analogy task: Copenhagen is to Denmark as Berlin is to …?

With queries to Wikidata Query Service (WDQS) it is reasonably easy to generate word analogy datasets in whatever (Wikidata-supported) language you like. For instance, for capitals and countries, a WDQS SPARQL query that returns results in Danish could go like this:

select
  ?country1Label ?capital1Label
  ?country2Label ?capital2Label
where { 
  ?country1 wdt:P36 ?capital1 .
  ?country1 wdt:P463 wd:Q1065 .
  ?country1 wdt:P1082 ?population1 .
  filter (?population1 > 5000000)
  ?country2 wdt:P36 ?capital2 .
  ?country2 wdt:P463 wd:Q1065 .
  ?country2 wdt:P1082 ?population2 .
  filter (?population2 > 5000000)
  filter (?country1 != ?country2)
  service wikibase:label
    { bd:serviceParam wikibase:language "da". }  
} 
limit 1000

Follow this link to get to the query and press “Run” to get the results. It is possible to download the table as CSV-formatted (see under “Download”). One issue to note that you have multiple entries for countries with multiple capital cities, e.g., Sydafrika (South Africa) is listed with Pretoria, Kapstaden (Cape Town) and Bloemfontein.

Mixed indexing with integer index in Pandas DataFrame

Posted on Updated on

Indexing in Python’s Pandas can at times be tricky. Here is an example with mixed indexing (.ix) with integer index:

I ran into the issue when I wanted index with integer for DataFrame representing EEG data in one of its methods

Hull level coloring of a cortical surface representation

Posted on Updated on

 

Hull level coloring of a cortical surface representation constructed by Heather Drury and David Van Essen.
Hull level coloring of a cortical surface representation constructed by Heather Drury and David Van Essen.

I have just rediscovered by old surface coloring function from the 2003 version of the Brede Toolbox. It can color a surface according to hull level. Here it is with a modified cortical surface representation provided by Heather Drury and David Van Essen.

Matlab code with the Brede Toolbox:

S = brede_sur_drury;
color = brede_sur_color(S, 'style', 'rgb');figure, 
brede_ta3_frame, brede_ta3_sur(S, 'color', color);

and then followed by

print -dpng hulllevelcoloring.png

Zipf plot for word counts in Brown corpus

Posted on

Image

There are various ways of plotting the distribution of highly skewed (heavy-tailed) data, e.g., with a histogram with logarithmically-spaced bins on a log-log plot, or by generating a Zipf-like plot (rank-frequency plot) like the above. This figure uses token count data from the Brown corpus as made available in the NLTK package.

For fitting the Zipf-curve a simple Scipy-based approach is suggested on Stackoverflow by “Evert”. More complicated power-law fitting is implemented on the Python package powerlaw described in Powerlaw: a Python package for analysis of heavy-tailed distributions that is based on the Clauset-paper.