Latest Event Updates

COVID-19 video conferencing

Posted on Updated on

The COVID-19 lockdowns have brought video conferencing to the forefront of attention for many desktop workers, including me. There is a major selection of video conference systems and the choice is not obvious.

Skype was probably the first video conferencing I tried seriously, – and many years ago. For association board meetings, we are now regularly using it, but usually with audio only and a combination with Etherpad. I am not aware of webinar-like use of Skype.

Skype for Business I have experienced but not used. Danish authorities often live in a Microsoft world were Skype for Business seems to be one of the limited video conferencing options. It has not been possible for me to get it to work on my Ubuntu computer. There may be an Android version that could be installed and used in case of emergency.

My attempt on using Adobe Connect was unfavorable. The number of licenses may have been overwhelmed as I experienced major connection problems the first day of the lockdown for course work. This particular system is using Flash, which I actually thought was dead. The installation was possible but, I my opinion, not recommendable on an Ubuntu system. I have not tried it for anything serious.

For YouTube live streaming, there is a 24-hour waiting time. I have enabled it. I could not be used the first day for course webinar and I have not tried it later.

A colleagues suggest Twitch. Live streaming seems to require the download of an app: Open Broadcaster Software. The use of Twitch in machine learning, programming and university education seems limited, but I may not have looked carefully.

Zoom was quickly rumoured to be a good choice. My university quickly setup a special login with dedicated domain. Zoom would like to have you download a dedicated program. AFAIU chat is not maintained between sessions.

Zoom was in one aspect actually working better than in the physical meetings. The problem of getting connection from my laptop to the big screen in the meeting room is not necessarily a trivial task. Last time it embarrassingly took several minutes to get the signal through. In the online Zoom meeting, the screensharing seemed straightforward. The students could, with seemingly no effort, switch screensharing. However, this was only for an 8-person meeting.

Zoom is currently not the panacea. Recently, I experienced that I could not join a meeting because my version of Zoom was too old. Joining via the browser resulted in very poor audio quality. Upgrading the program was quick, but some audio problems persisted.

A PhD Student at our section affiliated with the local Microsoft company suggested (obviously) Microsoft Teams. Like Zoom, Microsoft would like you to download the  dedicated Microsoft Teams program. Chat is maintained between sessions. This is a good choice compared to Zoom. For video conferencing for a course, subchannels can be made where individual teaching assistance can be provided.

Another PhD Student at our section was affiliated with Telenor and suggested (obviously) Appear.in of the company. It has now changed name to Whereby and another company seems to have taken over the service. Whereby is Norwegian, neat and nibble. It requires less setup than other services: You just point you browser to a specific URL. The bad news is that Whereby is restricted to 4 users. For an emergency where I could not get Skype for Business working, I upgraded to a paid account with a maximum of 12 users and distributed my dedicated URL that the other video conference participants could copy-and-paste into the browser with no further problems, – from my side. There may have been browser issues and one need to try Google Chrome instead of Firefox. The owner of the room controls who gets in. Other users need to knock on “the conference door”.

For the #vBIB20 conference, the organizers used gotowebinar which I also should use for an invited talk. I tested it the day before and it seem to work for me. On the day of the talk, I found out that the gotowebinar link, that I had, was not for the panelist and a special link was required. Searching in the depth of my emails, I found the links, but unfortunately it did not work. A system test for gotowebinar was also available but deceptively reported that my system was ok. The system did not seem to support Linux systems for panelists. The problem seems also to have hit Wikicite colleague Jakob Voß: “Wenn dir erst 3 Stunden vor deiner #vBIB20 Präsentation auffällt, dass GoToMeeting für Referent*innen kein Linux unterstützt“, he wrote on Twitter.

For part of the #vBIB20 conference, the WikiLunch, the organizers used an instance of Jitsi running from a Wikimedia Deutschland controlled domain. Wikimedia Deutschland has continuously shown themselves to be the technically strongest Wikimedia chapter. I had not seriously tried Jitsi before, but the video session seemed – from my part – to work more or less well, both in-browser and screensharing. It was unclear to me how many had joined the Jitsi meeting.

For recording of screencasts, I have downloaded open source OBS Studio. I have recorded my screencasts and afterwards uploaded the movie file to our dedicated university video server.

From the Wikipedia world, particularly Andrew Lih, I have heard of Streamlabs. As I understand it can stream to a range of (“any”?) platform making you less reliant on each service.

Discord I have been introduced to, but not used for video conferencing.

Google has some services. Gsuite mimics some of the Microsoft programs. I have not yet used Google video conferencing, besides Google Hangout several years ago.

A problem with these video conferencing systems is that the dedicated programs fill up the harddisk and sometimes seem to compete for the camera.

Some problems with Danish and Wikidata lexemes

Posted on

 

Ordia-professortiltrædelsesforelæsning

Is is at all possible to describe natural languages in a structured way? There are many special cases and oddities of the Danish language one continuously discovers when entering Danish lexemes in Wikidata.

  1. What do we do with øl (beer). It can both be a neutrum and an common gender (utrum) word and the semantics between the two versions differ. In Wikidata they can both appear under the same lexemes, but how one then keeps track of that one form is associated with one sense and another form with another sense is less clear. Qualifiers may be brought to help. There are, however, to my knowledge no property that currently can be used to this.
  2. Is hænge (hang) one or two lexemes? There is a transitive and an intransitive version where there is a slight semantic difference. Den Danske Ordbog (DDS) has only one entry for hænge and then spend some words explaining the form-sense complexity.  Wikidata has currently L45348 and L45349.
  3. Is the “an” in “ankomme“, “ankomst”, “anslå”, “anvende”, etc. a dedicated prefix or should be regarded as the “an” adverb attached to a verb or a noun? The problem with regarding “an” as a prefix is that many other words that prefix to komst are adverbs: “bort”, “frem”, “ind”, “op”, “sammen”, “til” and these units does not look prefix’ish to me.
  4. It is sometimes not clear where a part of compound should linked to. For instance, tilskadekomst (injury) can be decomposed into “til”, “skade” and “komst”. The “til” could be regarded as a preposition or an adverb. For indeklima (indoor climate), inde could be the adverb inde or the adverb ind plus an -e- interfix.
  5. Should briller (glasses) and bukser (trousers) be plurale tantum? In common language briller and bukser are plurale tantum, but among professional sales persons you find the singular versions brille and buks. How would you indicate that? Note that compounds with the words may have the singular versions, e.g., bukseopslag and brilleetui.
  6. For singulare tantum/countable nouns, singular forms of lexemes may be so prevalent and plural forms so rare that it may be a question whether the word is singulare tantum or a countable noun, e.g., tillid (trust) may be found in rare instances as tillider, but does that make is a countable noun?
  7. What word is komst? Is it a suffix? Then what about the word genkomst, – it has the prefix “gen-” and the the suffix komst…, so where is the root? Maybe it is better to regard it as a part of a tyttebærord, where a word once recognized as an independent word has “lost its independence”. Komst has an entry in the old Danish dictionary, but not in newer Danish dictionaries.
  8. Following Grammatik over det Danske Sprog (GDS), some nouns have been added as “nexual nouns” or “innexual nouns”. The precise delineation of these classes are not clear, e.g., where would agent words such as woman, carpenter and cat be placed? The are not nexual, as far as I can see, but does that make them innexual? There is a range of Danish words where I am unsure: landskab (landscape), musikrum (music room), etc. So far I have left any annotation of such words out.
  9. Where do beslutte and similar words derive from? According to Den Danske Ordbog (DDS), it derives from middelnedertysk “besluten”, but could also be regarded as derived from a “be-” prefix and the verb “slutte”. It is only partially possible to represent both paths in the derivation in Wikidata.
  10. Wikidata has the “lexical category” field. For an affix it is not clear what the category should be. It could be affix, suffix/prefix, or perhaps something else?
  11. A particular class of words at the intersection of nouns and verbs are what has been term centaur. They might be inflections of verbs or they might be derivations from verbs to nouns. Examples are råben (shouting as a noun), løben (running) and skrigen (screaming). They do not seem to have any inflections themselves, so should they then be regarded as just an inflection of a verb and put in as a form under the corresponding verbal lexeme, e.g., råbe? On the other hand, DDS has råben as an independent entry and I also added råben as an independent lexeme in Wikidata. In Wikidata, this enable a more straightforward link to the synonym råberi.
  12. Which lexical category should we link compounds to? Some compounds may be analyzed to arise from a noun or a verb (or possibly other lexical categories), e.g., springvand has the parts spring and vand. It is not – at least to me – clear whether spring should be regarded as linked to the noun spring or to the root form of the verb springe.
  13. Should the s-genitive form of Danish nouns be recorded under forms? The naïve approach is to add the s-genitive forms doubling the amount of Danish noun forms. Modern linguists seem think (if I understand them correctly) that the appended “s” is enclitic and the s-genitive not a form, – much like in English where the ‘s genitive are not recorded as a form. For English the apostroph separates the s from the rest of the word, so there is is natural not to include the genitive form.
  14. Hjem, hjemme and hjemad are three words an possibly one, two or three lexemes. If they are three lexemes then how can we link them?
  15. When is a noun derived from a verb and not the other way around? It is particularly a question for (possible) root derivations, where the noun is shorter than the verb. For the noun bijob and the verb bijobbe it seems that the noun forms the basis for the derivation to the verb.
  16. genforhandling (renegotiation) can (in my opinion) be derived from at least two paths: gen+forhandling (re+negotiation) and genforhandl+ing (renegotiate+ion). The derived property in can contain both, but the combines property is not suitable for this case.
  17. Professortiltrædelsesforelæsning is another word where I am uncertain how to best decompose the word: professor+tiltrædelsesforelæsning or professortiltrædelse+s+forelæsning?
  18. What sort of word is politiker? A nomen agentis is (usually) derived from verbs, and with the analysis of the word into politik+er, then politiker is not a nomen agentis. But Den Store Danske notes that nomen agentis can be derived from other “nomens”, e.g., skuespiller (actor) and skuespil (acting or play). So is it ok to regard politiker as a nomen agentis?
  19. Some words for items that might appear as a collective is a singular concept in Wikidata lexeme and a collective in Wikidata, e.g., dansker (Dane) is danskere (Danes) in Wikidata’s Q-items. Connecting the entities via P5137, is a bit of a stretch. The same may be said to be an issue for animals and species.

Read also my paper Danish in Wikidata lexemes (Scholia) and perhaps also Validating Danish Wikidata lexemes (Scholia).

 

Photo in graph: Reading (7052753377).jpg, Moyan Breen, CC-BY 2.0.

What does “toxicity” mean?

Posted on Updated on

There are now a range of people using “toxic” and “toxicity” in the context of messages on social media. I have had a problem with these words because I lacked a clear definition of the concept behind them. What is a toxic social media post? Negative sentiment, rudeness, harassment, cyberbullying, trolling, rumour spreading false news, heated arguments and possibly more may be mixed together.

The English Wiktionary currently lists “Severely negative or harmful” as a gloss for a figurative sense of the word “toxic”.

For social media, a 2015 post in relation to League of Legends, “Doing Something About the ‘Impossible Problem’ of Abuse in Online Games“,  mentions “toxicity” along with “online harassment”. They “classified online citizens from negative to positive”, apparently based on language from “trash talk” to “non-extreme but still generally offensive language”. What precisely “trash talk” is in the context of social media is not clear to me. The English Wikipedia describes “Trash-talk” in the context of sports. A related term, “Smack talk”, is defined for Internet behaviour.

There are now a few scholarly papers using the wording.

For instance, “Detecting Toxicity Triggers in Online Discussions” writes “Detecting these toxicity triggers is vital because online toxicity has a contagious nature” from September 2019 cites our paper “Good Friends, Bad News – Affect and Virality in Twitter“. I think that this citation has some issues. First of all, we do not use the word “toxicity” in the our paper. Previously in the paper the authors seems to equate toxicity with rudeness and harassment, but our paper did not specifically look at that. Our paper was particularly focus on “newsness” and sentiment score. A simplified conclusion would be that negative news are more viral. News articles are rarely rude or harassing.

Julian Risch and Ralf Krestel in “Toxic Comment Detection in Online Discussions” write: “A toxic comment is defined as a rude, disrespectful, or unreasonable comment that is likely to make other users leave a discussion”. This phrase seems to originate from the Kaggle competition “Toxic Comment Classification Challenge” from 2018: “…negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion)”. The aspects of the competition that would be classified would be “threats, obscenity, insults, and identity-based hate”.

Risch and Krestel have been the first I have run into with a good discussion of the aspects of what they call toxicity. They seems to be inspired by the work on Wikipedia citing Ellery Wulczyn et al.’s “Ex Machina: Personal Attacks Seen at Scale”. Wulczyn’s work goes back to 2016 with the Detox research project. This research project may have been spawned by an entry in the 2015 wishlist in the Wikimedia community. “Algorithms and insults: Scaling up our understanding of harassment on Wikipedia” is a blogpost on the research project.

The Wulczyn-paper describes the construction of a corpus of comments from the article and user talk pages of the English Wikipedia. The labelling described in the paper would focus on “personal attack or harassment”. The authors define a “toxicity level” quantity as the number of personal attacks by a user (in the particular year examined). Why “personal attack level” is not used instead of the word “toxicity” is not clear.

It is interesting that the Kaggle competition defines “toxicity” via the likelihood that would “make other users leave a discussion”. I would usually think that heated discussions would tend to attract people to the discussion, – at least in “discussion social media” such as Reddit, Twitter and Facebook, though I suppose this is an open question. I do not recall seeing any study modelling the relationship between user retention and personal attack and obscene language.

The paper “Convolutional Neural Networks for Toxic Comment Classification” from 2018 cites a Maeve Duggan PEW report, “Online Harassment“, in the context “Text arising from online interactive communication hides many hazards such as fake news, online harassment and toxicity”. If you lookup the PEW report the words “fake news” and “toxic” hardly appear (only quoting a user comment for “toxic masculinity”).

Google’s Perspective API can analyze a text and give back a “toxicity” score.

The current English Wikipedia article on “toxicity” only describe the chemical sense of the word. The “toxic” disambiguation page has 3 relevant links: “toxic leader”, “toxic masculinity” and “toxic workplace”.

It still seems to me that “toxicity” and “toxic” are a too fuzzy words to be used in serious contexts without proper definition. It is also not clear to me if, e.g., the expression of strong negative sentiment, which could potentially be classified as “toxic”, necessarily negatively affect productivity and the health of the community. The 2015 harassment survey from the Wikimedia Foundation examined “Effects of experiencing harassment on participation levels” (Figure 47) and at least here the effect seems to be seriously negative on Wikimedia projects participation level. The word toxic was apparently not used in the survey, though under the example ideas for improvements from the respondents are listed: “Scoring the toxicity of users and watching toxic users’ actions in a community tool like the anti-vandal software.”

NeurIPS in Wikidata

Posted on Updated on

scholia-neurips-2019-co-authors.png
Co-authors in the NeurIPS 2019 conference based on data in Wikidata. Screenshot based on Scholia at https://tools.wmflabs.org/scholia/event/Q61582928.

The machine learning and neuroinformatics conference NeurIPS 2019 (NIPS 2019) takes place in the middle of December 2019. The conference series has always had a high standing and has grown considerably in reputation in recent years.

All papers from the conference are available online at papers.nips.cc. There is – to my knowledge – little structured metadata associated with the papers, though the website is consistently formatted in HTML and metadata can thus relatively easy be scraped. There are no consistent identifiers that I know of that identifies the papers on the site: No DOI, no ORCID iD or anything else. A few papers may be indexed here and there on third-party sites.

In the Scholia python package, I have made a module for scraping the papers.nips.cc website and convert the metadata to the Quickstatement format for Magnus Manske’s web application that submits the data to Wikidata. The entry of the basic metadata about the papers from NeurIPS is more or less complete. A check is needed to see if all is entered. One issue that the Python code attempts to counter is the cases where the scraped paper is already entered in Wikidata. Given that there are no identifiers the match attempt is somewhat heuristic.

Authors have separate webpages on papers.nips.cc with listing of papers published at the conference. This is quite well-curated, though I have discovered authors that have several webpages associated: The Bayesian Carl Edward Rasmussen is under http://papers.nips.cc/author/carl-edward-rasmussen-1254, https://papers.nips.cc/author/carl-e-rasmussen-2143 and http://papers.nips.cc/author/carl-edward-rasmussen-6617. Joshua B. Tenenbaum is also split.

Authors are not resolved with the code from Scholia. They are just represented as strings. The Author Disambiguator tool that Arthur P. Smith has built from a tool by Magnus Manske can semi-automatically resolve authors, i.e., associate the author of a paper with a specific Wikidata item representing a human. The Scholia web site has particular pages (“missing”) that make contextual links to the Author Disambiguator. For the NeurIPS 2019 proceedings the links can be seen at https://tools.wmflabs.org/scholia/venue/Q68600639/missing. There are currently over 1,400 authors that needs to be resolved. Some of these are not easy. Multiple authors may share the same name, e.g., some European names, e.g., Andreas Krause, and I have difficulty knowing how unique East Asian names are. So far only 50 authors from the NeurIPS conference have been resolved.

There is no citation information when the data is first entered with the Scholia and Quickstatement tools. There are currently no means to automatically enter that information. NeurIPS proceedings are – as far as I know – not available through CrossRef.

Since there is little editorial control of the format of the references the come in various shapes and may need “interpretation”. For instance, “Semi-Supervised Learning in Gigantic Image Collections” claims a citation to “[3] Y. Bengio, O. Delalleau, N. L. Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning
eigenfunctions links spectral embedding and kernel PCA. In NIPS, pages 2197–2219, 2004.” But that is unlikely a NIPS paper, and the reference should likely go to Neural Computation.

The ontology of Wikidata for annotating what papers are about is not necessarily good. Some concepts in cognitive sciences, including psychology, machine learning and neuroscience, become split or merged, e.g., Reinforcement learning is an example where the English Wikipedia article focus on the machine learning aspect, while Wikidata also tag neuroscience-oriented articles with the concept. For many papers I find it difficult to link to the ontology as the topic of the paper is so specialized that it is difficult to identify an appropriate Wikidata item.

With the data in Wikidata, it is possible to see many aspect of the data with the Wikidata Query Service and Scholia. For instance,

  1. Who has the most papers at NeurIPS 2019? A panel of a Scholia page readily shows this to be Sergey Levine, Francis Bach, Pieter Abbeel and Yoshua Bengio.
  2. The heuristically computed topic scores on the event page for NeurIPS 2019 show that reinforcement learning, generative adversarial network, deep learning, machine learning and meta-learning are central topics this year. (here one needs to keep in mind that the annotation in Wikidata is imcomplete)
  3. Which Danish researcher has been listed as an author on most NeurIPS papers through time? This is possible to ask with a query to the Wikidata Query Service: https://w.wiki/DTp. It depends upon what is meant by “Danish”. Here it is based on employment/affiliation and gives Carl Edward Rasmussen, Ole Winther, Ricardo Henao, Yevgeny Seldin, Lars Kai Hansen and Anders Krogh.

Grammatik over det Danske Sprog

Posted on Updated on

Grammatik over det Danske Sprog (GDS) is a three-volume book series attempting to describe the grammar of the Danish language. The first edition was published in 2011, while the second is from 2019. Here is a few notes about the work:

  1. It is unclear to me if there is any difference between the first and second editions. I found no changes in the pages. I suppose minor changes might have occurred here and there.
  2. If one had thought that it would be straightforward to develop a computer-based grammar checker from the work, then one is wrong. It seems that the book has not been written with any computational linguistics in mind. But I should think that the many examples in the book can be used to generate data for training and evaluation computer-based systems.
  3. Interestingly, certain location adverbs are regarded as adverbs with inflection (page 216). Words such as ned, nedad and nede, I would regard them as independent lexemes, while GDS regards them as inflected based on telicity and dynamics/statics. In Wikidata, I have created three lexemes, rather than one lexemes with three forms. To me nedad is a compound of the two word “ned” and “ad”.
  4. Word stems are regarded as a feature of the form rather than the lexeme (page 165), so that the stem of adjective smukke is not smuk, but smukk!
  5. A word such as begynde is regarded as a cranberry word with gynde as the cranberry morpheme (page 255). Den Store Danske points to the middle German beginnen instead. If we take GDS’s approach then begynde should be said to be composed of the be- prefix and the gynde cranberry morpheme.
  6. From GDS and other Danish grammar works, I have not yet come to a clear understanding why certain linguistic elements are regarded as prefixes and when they are regarded as words in compounding. For instance, an- in ankomme is regarded as a prefix in GDS (page 256), but an is also an independent word and can go together with komme (“komme an”).
  7. The concept of “nexual” and “innexual” nouns (Danish: “neksuale/inneksuale substantiver”) is described, but it is not clear to me how words for agents (painter, human, animal) or words such as home or landscape should be annotated with respect to the concept.
  8. Lexemes for cardinal and ordinal numbers are called “kardinaltal” and “ordinaltal”. In Wikidata, I invented the words kardinaltalord and ordinaltalord to distinguish between cardinal numbers and the words that represent cardinal numbers.
  9. There are hardly any inline references. In many cases I am uncertain on whether the claim they present are widely established knowledge among linguistist or whether it is the authors’ sole expert opinion, which may or may not be contested.

Roberta’s +5-fine workshop on natural language processing

Posted on

Interacting Minds Centre (Scholia) at Aarhus University (Scholia) held a finely organized workshop, NLP workshop @IMC, Fall 2019, in November 2019 where I gave a talk title Detecting the odd-one-out among Danish words and phrases with word embeddings.

Fritz Günther (Scholia) keynoted from his publication Vector-Space Models of Semantic Representation From a Cognitive Perspective: A Discussion of Common Misconceptions (Scholia). A question is whether the distributed semantic models/vector models we identify from computing on large corpora makes sense with respect to cognitive theories: “Although DSMs might be valuable for engineering word meanings, this does not automatically qualify them as plausible psychological models”.

Two Årups displayed their work on SENTIDA “a new tool for sentiment analysis in Danish”. In its current form it is an R package. It has been described in the article SENTIDA: A New Tool for Sentiment Analysis in Danish (Scholia). According to their evaluation, SENTIDA beats my AFINN tool for Danish sentiment analysis.

word-intrusion
From our paper Combining embedding methods for a word intrusion task.

My own talk on Detecting the odd-one-out among Danish words and phrases with word embeddings was based on the distributional semantics representation evaluation work together with Lars Kai Hansen (Scholia): Our 2017 paper Open semantic analysis: The case of word level semantics in Danish (Scholia) and our newer 2019 paper Combining embedding methods for a word intrusion task (Scholia). The idea is to look on a Danish textual odd-one-odd task/word intrusion task and see what models trained on various corpora can do. Our current state-of-the-art is a combination of embedding models with fastText as the primary one and using Wembedder for proper nouns.

Two Aarhus students, Jan Kostkan and Malte Lau Petersen (Scholia) are downloading European parliament text data and analyzing them. A text corpora from Folketinget, the Danish Parliament may be available with 10s of millions of sentences.

Ulf Berthelsen, whom I share the Teaching platform for developing and automatically tracking early stage literacy skill (Scholia) research project with, spoke on late state literary skill.

Natalie Schluter (Scholia) spoke on glass ceiling effects in the natural language processing field. She has an associated paper The glass ceiling in NLP (Scholia) from EMNLP 2018.

Matthew Wilkens (Scholia) spoke on “Geography and gender in 20.000 British novels”, – large-scale analysis of how geography was used in British novels. This part fell much in alignment with some work I did a few years ago with geographically mapping narrative locations of Danish literature with the Littar website. and the paper Literature, Geolocation and Wikidata (Scholia) from the Wiki Workshop 2016.

Nielsen2019Detecting
Screenshot from Littar: Narrative locations of Danish literature.

There was a number of other contributions in the workshop.

The second day of the workshop featured hands-on text analysis with among others Rebekah Baglini and Matthew Wilkens getting participants to work on prepared Google Colab notebooks.

HACK4DK 2019: Lydmaleri

Posted on Updated on

Lydmaleri
Screenshot fra Lydmaleri, – mit HACK4DK 2019 bidrag.

I HACK4DK, en årligt tilbagevendende begivenhed i efterårets København, bringer museer, biblioteker, arkiver og hvad der nu ellers er deres åben data så entusiaster i form af programmører, data scientists, designere og lignende kan bygge ting og sager, typisk et computerprogram med en visualisering.

Jeg har vist været med fra 2013. I alle fald har min blog de billedremix jeg lavede: Gammelstrand remixed og Jailhouse remixed, – senere Kulturvet remixed og Fishy fishmongers of Fischer. Nyere HACK4DK bidrag er at finde på https://fnielsen.github.io/. Sidste år blev det til en analyse af danske film med data fra Det Danske Filminstitut via de data som hovedsagligt Steen Thomassen har overført til Wikidata.

HACK4DK 2019 spandt af på SMK på blot halvanden dag, fredag, lørdag, 15.-16. november 2019. Resultatet blev en ganske god række af projekter og visualiseringer. Mens det de tidligere år har været op og ned med folk der har været i stand til at få noget nyttigt ud af deep learning, så var der flere projekter der kom ganske godt i land med denne teknik.

Et projekt kombinerede styleGANs, deoldify og GPT-2 tekst-generering på gamle foto fra Kolding, – et klassisk datasæt i HACK4DK-sammenhænge. En flig af resultatet er vist i et af Andreas Refsgaards tweets. Her bliver der tilsyneladende samplet i et latent underrum og farvelagt via deoldify. Hvad måske var mest interessant var de falske biografier der kunne skabes via den tilhørerende tekst til billederne og GPT-2 samt en smule hjælp i form af begyndelser af tekst. Runway ML var anvendt.

Flere benyttede så vidt jeg forstod prætrænede Javascript-version af style-transfer-netværk, så som arbitrary-image-stylization-tfjs.

Albin Larsson konstruerede en interaktiv visualisering for SMK-malerier, så vidt jeg forstår baseret på Christopher Pietschs VIKUS viewer.

Jeg benyttede malerier og maleridata fra danske samlinger sådan som de er repræsenteret i Wikidata og Wikimedia Commons til Lydmaleri, hvor web-brugeren bliver præsenteret for et maleri med områder bundet til relevante lyde. For eksempel, billedet Et selskab af danske kunstnere i Rom (SMK, Wikipedia, Wikimedia Commons, Wikidata) viser en hund i højre hjørne. Når brugere klikker på hunden i Lydmaleri lyder et vov. Det er også muligt for brugeren at klikke sig videre til andre billeder. Til at få fat i data anvender jeg en SPARQL-forespørgsel der sendes til Wikidata Query Service. Resultatet behandles i websidens Javascript der afspiller lyden når klikket falder i lyd-rektanglet og skifter billedet og lyde ud når der bladres videre.

Lydmaleri benytter ingen maskinlæring. Istedet er objekt-genkendelsen i billedet baseret på informationen der eksplicit er indtastet i Wikidata. For Et selskab af danske kunstnere i Rom er således specificeret at der afbildes en hund og med en såkaldt kvalifikator kan med procent-koordinater angives hvor i billedet hunden befinder sig. Til indtastningen af koordinaterne kan anvendes Lucas Werkmeisters wd-image-positions web-applikation. Dette er en ganske tidsrøvende proces, som jo dog kan gøres kollaborativt på Internettet.

Som frontenddeveloperwannabee kommer mine CSS- og Javascript-evner nogle gange til kort. Billede-loading kunne være hurtigere og positioneringen af billederne og klik-området kunne også forbedres. Der er udfordringer med visse systemer. Således vil Apples Safari-browser tilsyneladende ikke afspille lydene, vistnok fordi OGG-audio-formatet ikke understøttes. Min Ubuntu Firefox og Ubuntu Google Chrome har ikke sådanne problemer. Android-systemer kan have problemer med at vise visse af sidens komponenter sådan som jeg havde tænkt det.

SEMANTiCS 2019

Posted on Updated on

Nielsen2019Validating_poster

SEMANTiCS (Scholia) is a conference in the artificial intelligence/Semantic Web domain. It is a combination of an industry and a research-oriented conference. Among Danish SEMANTiCS 2019 participants there was (I believe) a 5-2 overweight of non-academics compared to academics. The conference has so far been held at different locations in Europe. Next year a spin-off conference is to be held in Austin. In 2020, there will also be a conference in Amsterdam.

SEMANTiCS 2019 (Scholia) was held in Karlsruhe, – the birthplace of much of semantic wiki technology. If my notes are correct, then there were 88 submissions to the R&I track, 20 long papers and 8 short papers were excepted with an acceptance rate of 27%. With only two-days for the conference there is a considerable less research than at ESWC (Scholia). The invoice amount for my registration was €637.

The first day was allocated to workshops and I attended the 1st International Conference on the European Industry on Language Technology (ELSE-IF 2019). It was the smallest conference I have ever attended. We could fit around a table, I guess around 10 people participated. Luc Meertens presented work that has been prepared for the European Commission regarding European language technology. A large report is supposed to be published soon. The European language technology industry is relatively small and I sense that there is the fear that American (and perhaps Chinese?) bigtech may take part of the cake. As far as I remember the SDL company may be the largest European company. There are several European language technology research project. I made a short improvised demonstration of Wikidata lexemes and Ordia in the workshop.

On SEMANTiCS 2019, Michel Dumontier keynoted on FAIR (findable, accessible, interoperable, reusable) data principles. Valentina Presutti spoke on the issue of common knowledge in the Semantic Web pointing to ConceptNet, Tom Mitchell’s NELL, Atomic, The human know-how dataset, FrameNet and Framester.

Under the title Language-related Linked (Open) Data for Knowledge Solutions, Artificial Intelligence and more, Felix Sasaki from Cornelsen and Christian Lieske fra SAP presented information on linguistic linked open data and Wikidata. They have recently written an online article Wikidata gets wordier, that mentions and screenshots Ordia.

To my knowledge the articles were not published before the event, neither for the main conference, nor the workshops. However, our (Katherine Thornton (Scholia), Jose Emilio Labra Goya (Scholia) and I) SEMANTiCS poster paper Validating Danish Wikidata lexemes (Scholia) has been available for some time now. The minute madness slides and the poster are now also available. I have added a few of the other papers to Wikidata so they show up in Scholia’s page for SEMANTiCS 2019.

Update 19 September 2019: Preproceedings were in fact available. Thanks to Harshvardhan J. Pandit for making me aware of the link.

Kvinder uden mænd af Shahrnush Parsipur

Posted on Updated on

En roman først forklædt som en novellesamling siden afsløret som en roman beskriver forskellige 1950’er-iranerinder i trøstesløse og fastlåste situationer og voldelig mandsvælde hvor udbrydelserne sker ved fantastiske optrin, der måske er udtryk for en verdensflugt som følge af umuliggjorte handlingsevner.

Anmeldelse fra Librarything. Wikidata.

On the road to joint embedding with Wikidata lexemes?

Posted on Updated on

road-to-joint-embedding

Is is possible to use Wikidata lexemes for joint embedding, i.e., combining word embedding and knowledge graph entity embedding?

You can create on-the-fly text examples for joint embedding with the Wikidata Query Service. This SPARQL will attempt to interpolate a knowledge graph entity identifier into a text using the short usage example text (P5831):

 SELECT * {
  ?lexeme dct:language ?language ;
          wikibase:lemma ?lemma ;
          ontolex:lexicalForm ?form ;
          p:P5831 [
            ps:P5831 ?text ;
            pq:P5830 ?form 
          ] .
  BIND(SUBSTR(STR(?form), 32) AS ?entity)

  ?form ontolex:representation ?word .
  BIND(REPLACE(?text, STR(?word), ?entity) AS ?interpolated_text)
}

The result is here.

The interpolations are not perfect: There is a problem with capitalization in the beginning of a sentence, and short words may be interpolated into the middle of longer words (I am not able to get a regular expression with word separator “\b” working). Alternatively the SPARQL query result may be downloaded and the interpolation performed in a language that supports advanced regular expression patterns.

The number of annotated usage examples in Wikidata across languages is ridiculously small compared to the corpora typically applied in successful word embedding.

Update:

You can also interpolate the sense identifier: Here is the Wikidata Query Service result.