Wikidata

Some problems with Danish and Wikidata lexemes

Posted on

 

Ordia-professortiltrædelsesforelæsning

Is is at all possible to describe natural languages in a structured way? There are many special cases and oddities of the Danish language one continuously discovers when entering Danish lexemes in Wikidata.

  1. What do we do with øl (beer). It can both be a neutrum and an common gender (utrum) word and the semantics between the two versions differ. In Wikidata they can both appear under the same lexemes, but how one then keeps track of that one form is associated with one sense and another form with another sense is less clear. Qualifiers may be brought to help. There are, however, to my knowledge no property that currently can be used to this.
  2. Is hænge (hang) one or two lexemes? There is a transitive and an intransitive version where there is a slight semantic difference. Den Danske Ordbog (DDS) has only one entry for hænge and then spend some words explaining the form-sense complexity.  Wikidata has currently L45348 and L45349.
  3. Is the “an” in “ankomme“, “ankomst”, “anslå”, “anvende”, etc. a dedicated prefix or should be regarded as the “an” adverb attached to a verb or a noun? The problem with regarding “an” as a prefix is that many other words that prefix to komst are adverbs: “bort”, “frem”, “ind”, “op”, “sammen”, “til” and these units does not look prefix’ish to me.
  4. It is sometimes not clear where a part of compound should linked to. For instance, tilskadekomst (injury) can be decomposed into “til”, “skade” and “komst”. The “til” could be regarded as a preposition or an adverb. For indeklima (indoor climate), inde could be the adverb inde or the adverb ind plus an -e- interfix.
  5. Should briller (glasses) and bukser (trousers) be plurale tantum? In common language briller and bukser are plurale tantum, but among professional sales persons you find the singular versions brille and buks. How would you indicate that? Note that compounds with the words may have the singular versions, e.g., bukseopslag and brilleetui.
  6. For singulare tantum/countable nouns, singular forms of lexemes may be so prevalent and plural forms so rare that it may be a question whether the word is singulare tantum or a countable noun, e.g., tillid (trust) may be found in rare instances as tillider, but does that make is a countable noun?
  7. What word is komst? Is it a suffix? Then what about the word genkomst, – it has the prefix “gen-” and the the suffix komst…, so where is the root? Maybe it is better to regard it as a part of a tyttebærord, where a word once recognized as an independent word has “lost its independence”. Komst has an entry in the old Danish dictionary, but not in newer Danish dictionaries.
  8. Following Grammatik over det Danske Sprog (GDS), some nouns have been added as “nexual nouns” or “innexual nouns”. The precise delineation of these classes are not clear, e.g., where would agent words such as woman, carpenter and cat be placed? The are not nexual, as far as I can see, but does that make them innexual? There is a range of Danish words where I am unsure: landskab (landscape), musikrum (music room), etc. So far I have left any annotation of such words out.
  9. Where do beslutte and similar words derive from? According to Den Danske Ordbog (DDS), it derives from middelnedertysk “besluten”, but could also be regarded as derived from a “be-” prefix and the verb “slutte”. It is only partially possible to represent both paths in the derivation in Wikidata.
  10. Wikidata has the “lexical category” field. For an affix it is not clear what the category should be. It could be affix, suffix/prefix, or perhaps something else?
  11. A particular class of words at the intersection of nouns and verbs are what has been term centaur. They might be inflections of verbs or they might be derivations from verbs to nouns. Examples are råben (shouting as a noun), løben (running) and skrigen (screaming). They do not seem to have any inflections themselves, so should they then be regarded as just an inflection of a verb and put in as a form under the corresponding verbal lexeme, e.g., råbe? On the other hand, DDS has råben as an independent entry and I also added råben as an independent lexeme in Wikidata. In Wikidata, this enable a more straightforward link to the synonym råberi.
  12. Which lexical category should we link compounds to? Some compounds may be analyzed to arise from a noun or a verb (or possibly other lexical categories), e.g., springvand has the parts spring and vand. It is not – at least to me – clear whether spring should be regarded as linked to the noun spring or to the root form of the verb springe.
  13. Should the s-genitive form of Danish nouns be recorded under forms? The naïve approach is to add the s-genitive forms doubling the amount of Danish noun forms. Modern linguists seem think (if I understand them correctly) that the appended “s” is enclitic and the s-genitive not a form, – much like in English where the ‘s genitive are not recorded as a form. For English the apostroph separates the s from the rest of the word, so there is is natural not to include the genitive form.
  14. Hjem, hjemme and hjemad are three words an possibly one, two or three lexemes. If they are three lexemes then how can we link them?
  15. When is a noun derived from a verb and not the other way around? It is particularly a question for (possible) root derivations, where the noun is shorter than the verb. For the noun bijob and the verb bijobbe it seems that the noun forms the basis for the derivation to the verb.
  16. genforhandling (renegotiation) can (in my opinion) be derived from at least two paths: gen+forhandling (re+negotiation) and genforhandl+ing (renegotiate+ion). The derived property in can contain both, but the combines property is not suitable for this case.
  17. Professortiltrædelsesforelæsning is another word where I am uncertain how to best decompose the word: professor+tiltrædelsesforelæsning or professortiltrædelse+s+forelæsning?
  18. What sort of word is politiker? A nomen agentis is (usually) derived from verbs, and with the analysis of the word into politik+er, then politiker is not a nomen agentis. But Den Store Danske notes that nomen agentis can be derived from other “nomens”, e.g., skuespiller (actor) and skuespil (acting or play). So is it ok to regard politiker as a nomen agentis?
  19. Some words for items that might appear as a collective is a singular concept in Wikidata lexeme and a collective in Wikidata, e.g., dansker (Dane) is danskere (Danes) in Wikidata’s Q-items. Connecting the entities via P5137, is a bit of a stretch. The same may be said to be an issue for animals and species.

Read also my paper Danish in Wikidata lexemes (Scholia) and perhaps also Validating Danish Wikidata lexemes (Scholia).

 

Photo in graph: Reading (7052753377).jpg, Moyan Breen, CC-BY 2.0.

NeurIPS in Wikidata

Posted on Updated on

scholia-neurips-2019-co-authors.png
Co-authors in the NeurIPS 2019 conference based on data in Wikidata. Screenshot based on Scholia at https://tools.wmflabs.org/scholia/event/Q61582928.

The machine learning and neuroinformatics conference NeurIPS 2019 (NIPS 2019) takes place in the middle of December 2019. The conference series has always had a high standing and has grown considerably in reputation in recent years.

All papers from the conference are available online at papers.nips.cc. There is – to my knowledge – little structured metadata associated with the papers, though the website is consistently formatted in HTML and metadata can thus relatively easy be scraped. There are no consistent identifiers that I know of that identifies the papers on the site: No DOI, no ORCID iD or anything else. A few papers may be indexed here and there on third-party sites.

In the Scholia python package, I have made a module for scraping the papers.nips.cc website and convert the metadata to the Quickstatement format for Magnus Manske’s web application that submits the data to Wikidata. The entry of the basic metadata about the papers from NeurIPS is more or less complete. A check is needed to see if all is entered. One issue that the Python code attempts to counter is the cases where the scraped paper is already entered in Wikidata. Given that there are no identifiers the match attempt is somewhat heuristic.

Authors have separate webpages on papers.nips.cc with listing of papers published at the conference. This is quite well-curated, though I have discovered authors that have several webpages associated: The Bayesian Carl Edward Rasmussen is under http://papers.nips.cc/author/carl-edward-rasmussen-1254, https://papers.nips.cc/author/carl-e-rasmussen-2143 and http://papers.nips.cc/author/carl-edward-rasmussen-6617. Joshua B. Tenenbaum is also split.

Authors are not resolved with the code from Scholia. They are just represented as strings. The Author Disambiguator tool that Arthur P. Smith has built from a tool by Magnus Manske can semi-automatically resolve authors, i.e., associate the author of a paper with a specific Wikidata item representing a human. The Scholia web site has particular pages (“missing”) that make contextual links to the Author Disambiguator. For the NeurIPS 2019 proceedings the links can be seen at https://tools.wmflabs.org/scholia/venue/Q68600639/missing. There are currently over 1,400 authors that needs to be resolved. Some of these are not easy. Multiple authors may share the same name, e.g., some European names, e.g., Andreas Krause, and I have difficulty knowing how unique East Asian names are. So far only 50 authors from the NeurIPS conference have been resolved.

There is no citation information when the data is first entered with the Scholia and Quickstatement tools. There are currently no means to automatically enter that information. NeurIPS proceedings are – as far as I know – not available through CrossRef.

Since there is little editorial control of the format of the references the come in various shapes and may need “interpretation”. For instance, “Semi-Supervised Learning in Gigantic Image Collections” claims a citation to “[3] Y. Bengio, O. Delalleau, N. L. Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning
eigenfunctions links spectral embedding and kernel PCA. In NIPS, pages 2197–2219, 2004.” But that is unlikely a NIPS paper, and the reference should likely go to Neural Computation.

The ontology of Wikidata for annotating what papers are about is not necessarily good. Some concepts in cognitive sciences, including psychology, machine learning and neuroscience, become split or merged, e.g., Reinforcement learning is an example where the English Wikipedia article focus on the machine learning aspect, while Wikidata also tag neuroscience-oriented articles with the concept. For many papers I find it difficult to link to the ontology as the topic of the paper is so specialized that it is difficult to identify an appropriate Wikidata item.

With the data in Wikidata, it is possible to see many aspect of the data with the Wikidata Query Service and Scholia. For instance,

  1. Who has the most papers at NeurIPS 2019? A panel of a Scholia page readily shows this to be Sergey Levine, Francis Bach, Pieter Abbeel and Yoshua Bengio.
  2. The heuristically computed topic scores on the event page for NeurIPS 2019 show that reinforcement learning, generative adversarial network, deep learning, machine learning and meta-learning are central topics this year. (here one needs to keep in mind that the annotation in Wikidata is imcomplete)
  3. Which Danish researcher has been listed as an author on most NeurIPS papers through time? This is possible to ask with a query to the Wikidata Query Service: https://w.wiki/DTp. It depends upon what is meant by “Danish”. Here it is based on employment/affiliation and gives Carl Edward Rasmussen, Ole Winther, Ricardo Henao, Yevgeny Seldin, Lars Kai Hansen and Anders Krogh.

HACK4DK 2019: Lydmaleri

Posted on Updated on

Lydmaleri
Screenshot fra Lydmaleri, – mit HACK4DK 2019 bidrag.

I HACK4DK, en årligt tilbagevendende begivenhed i efterårets København, bringer museer, biblioteker, arkiver og hvad der nu ellers er deres åben data så entusiaster i form af programmører, data scientists, designere og lignende kan bygge ting og sager, typisk et computerprogram med en visualisering.

Jeg har vist været med fra 2013. I alle fald har min blog de billedremix jeg lavede: Gammelstrand remixed og Jailhouse remixed, – senere Kulturvet remixed og Fishy fishmongers of Fischer. Nyere HACK4DK bidrag er at finde på https://fnielsen.github.io/. Sidste år blev det til en analyse af danske film med data fra Det Danske Filminstitut via de data som hovedsagligt Steen Thomassen har overført til Wikidata.

HACK4DK 2019 spandt af på SMK på blot halvanden dag, fredag, lørdag, 15.-16. november 2019. Resultatet blev en ganske god række af projekter og visualiseringer. Mens det de tidligere år har været op og ned med folk der har været i stand til at få noget nyttigt ud af deep learning, så var der flere projekter der kom ganske godt i land med denne teknik.

Et projekt kombinerede styleGANs, deoldify og GPT-2 tekst-generering på gamle foto fra Kolding, – et klassisk datasæt i HACK4DK-sammenhænge. En flig af resultatet er vist i et af Andreas Refsgaards tweets. Her bliver der tilsyneladende samplet i et latent underrum og farvelagt via deoldify. Hvad måske var mest interessant var de falske biografier der kunne skabes via den tilhørerende tekst til billederne og GPT-2 samt en smule hjælp i form af begyndelser af tekst. Runway ML var anvendt.

Flere benyttede så vidt jeg forstod prætrænede Javascript-version af style-transfer-netværk, så som arbitrary-image-stylization-tfjs.

Albin Larsson konstruerede en interaktiv visualisering for SMK-malerier, så vidt jeg forstår baseret på Christopher Pietschs VIKUS viewer.

Jeg benyttede malerier og maleridata fra danske samlinger sådan som de er repræsenteret i Wikidata og Wikimedia Commons til Lydmaleri, hvor web-brugeren bliver præsenteret for et maleri med områder bundet til relevante lyde. For eksempel, billedet Et selskab af danske kunstnere i Rom (SMK, Wikipedia, Wikimedia Commons, Wikidata) viser en hund i højre hjørne. Når brugere klikker på hunden i Lydmaleri lyder et vov. Det er også muligt for brugeren at klikke sig videre til andre billeder. Til at få fat i data anvender jeg en SPARQL-forespørgsel der sendes til Wikidata Query Service. Resultatet behandles i websidens Javascript der afspiller lyden når klikket falder i lyd-rektanglet og skifter billedet og lyde ud når der bladres videre.

Lydmaleri benytter ingen maskinlæring. Istedet er objekt-genkendelsen i billedet baseret på informationen der eksplicit er indtastet i Wikidata. For Et selskab af danske kunstnere i Rom er således specificeret at der afbildes en hund og med en såkaldt kvalifikator kan med procent-koordinater angives hvor i billedet hunden befinder sig. Til indtastningen af koordinaterne kan anvendes Lucas Werkmeisters wd-image-positions web-applikation. Dette er en ganske tidsrøvende proces, som jo dog kan gøres kollaborativt på Internettet.

Som frontenddeveloperwannabee kommer mine CSS- og Javascript-evner nogle gange til kort. Billede-loading kunne være hurtigere og positioneringen af billederne og klik-området kunne også forbedres. Der er udfordringer med visse systemer. Således vil Apples Safari-browser tilsyneladende ikke afspille lydene, vistnok fordi OGG-audio-formatet ikke understøttes. Min Ubuntu Firefox og Ubuntu Google Chrome har ikke sådanne problemer. Android-systemer kan have problemer med at vise visse af sidens komponenter sådan som jeg havde tænkt det.

On the road to joint embedding with Wikidata lexemes?

Posted on Updated on

road-to-joint-embedding

Is is possible to use Wikidata lexemes for joint embedding, i.e., combining word embedding and knowledge graph entity embedding?

You can create on-the-fly text examples for joint embedding with the Wikidata Query Service. This SPARQL will attempt to interpolate a knowledge graph entity identifier into a text using the short usage example text (P5831):

 SELECT * {
  ?lexeme dct:language ?language ;
          wikibase:lemma ?lemma ;
          ontolex:lexicalForm ?form ;
          p:P5831 [
            ps:P5831 ?text ;
            pq:P5830 ?form 
          ] .
  BIND(SUBSTR(STR(?form), 32) AS ?entity)

  ?form ontolex:representation ?word .
  BIND(REPLACE(?text, STR(?word), ?entity) AS ?interpolated_text)
}

The result is here.

The interpolations are not perfect: There is a problem with capitalization in the beginning of a sentence, and short words may be interpolated into the middle of longer words (I am not able to get a regular expression with word separator “\b” working). Alternatively the SPARQL query result may be downloaded and the interpolation performed in a language that supports advanced regular expression patterns.

The number of annotated usage examples in Wikidata across languages is ridiculously small compared to the corpora typically applied in successful word embedding.

Update:

You can also interpolate the sense identifier: Here is the Wikidata Query Service result.

Danish public domain authors publishing after the spelling reform of 1948

Posted on Updated on

One annoying feature with finding Danish language use examples for Wikidata lexemes is the Spelling reform of 1948 and the requirement of Creative Commons Zero license of Wikidata.

The Spelling reform of 1948 means that old public domain works in Danish, e.g., by Søren Kirkegaard and Hans Christian Andersen are with an old spelling which entails capital first letter for common nouns, the use of “aa” instead of the modern “å” and certain other spelling variations.

Works in Danish published after 1948 might have the new spelling (but verbatim reprints/republications of, e.g., Hans Christian Andersen’s works might still have the old spelling). Unfortunately the copyright law requires the author to dead for more than 70 years before his/her works fall into public domain and we can use it in Wikidata (It is unclear – to me at least – whether the use of short excerpts, e.g., a subsentence from a copyrighted work can be regarded as public domain). Given that we are now more than 70 years away from 1948 we might begin to be “lucky” to see works published after the spelling reform and where the author has died, e.g., in 1949. Such work will soon fall into public domain and we could use these in various context in Wikidata, particularly for the language use examples in Wikidata lexemes. Can we find such works?

My idea was to turn to Wikidata and formulate a SPARQL query against Wikidata Query Service for works published after 1948 and where the author has a death date. Here is one attempt:

SELECT ?work ?workLabel ?author ?authorLabel ?death_date WHERE {
  ?work wdt:P50 ?author .
  ?work wdt:P407 wd:Q9035 .
  ?work wdt:P577 ?publication_date .
  ?author wdt:P570 ?death_date .
  FILTER (YEAR(?publication_date) > 1948)
  SERVICE wikibase:label { bd:serviceParam wikibase:language   
    "[AUTO_LANGUAGE],da,en". }
}
ORDER BY ?death_date
LIMIT 100

The result is available here. Works of Steen Steensen Blicher, Søren Kierkegaard, H.C. Andersen, Meïr Aron Goldschmidt, Ludvig Mylius-Erichsen are in public domain and some of the works have been published after 1948. Some of Ludvig Mylius-Erichsen’s works are available on Wikisource, e.g., Julegæster fra havet. The version on Wikisource is with a modern Danish spelling. It has been used a bit for Wikidata lexemes, see the Ordia page for Julegæster fra havet: https://tools.wmflabs.org/ordia/reference/Q22084925.

Wikidata leksemer og Ordia

Posted on Updated on

ordia-danish-lexical-categoriesI 2018 fik Wikidata mulighed for at repræsentere leksemer (ordbogsopslag) og herunder deres former (dvs. bøjninger) og betydninger (på engelsk: senses). Wikidata-siderne for leksemer adskiller sig fra de almindelige emne-sider på Wikidata: Der er specielle felter til angivelse af sprog, leksikal kategori (ordklasse), grammatiske karakteristikker og for betydninger er der “gloss”-er. Idéen er at få Wikidata til at fungere som en struktueret og maskinlæsbar pendant til Wiktionary.

Da Wikidata, og dermed Wikidatas leksemer, er under Creative Commons Zero-licensen er det ikke umiddelbart nemt at finde gode leksikografiske resurser, og leksemerne er mere eller mindre indtastet manuelt. Der findes enkelte online værktøjer der letter indtastningen: Lucas Wekmeisters forms og Alicia Fagervings senses. Engelske leksemer er vel ikke overraskende dem der i øjeblikket er flest af. Fransk, svensk, nynorsk, polsk og tysk er også godt med. For dansk har jeg indtastet godt over 1.000 leksemer med tilhørende bøjninger og en del betydninger. Mange er linket til det danske ordnet der går under navnet DanNet. En del betydninger – særligt for navneordene – er linket til Wikidatas vanlige emner. Herfra kan man “gå rundt” i vidensgrafen og få hyponymer, hypernymer, synonymer og oversættelser.

Fyldigheden af Wikidatas leksemer både hvad angår antal leksemer og interlinkningen – er stadig noget svag og de forskellige ordbøger man kan skabe ud fra data (etymologisk ordbog, oversættelsesordbog, begrebsordbog, retstavningsordbog) er vel i øjeblikket noget sølle.

Parallel med indtastningen af leksemer har jeg udviklet og udvikler en webapplikation til at vise Wikidatas leksemer: Ordia. Den er tilgængelig fra Wikimedias computersky Toolforge. Da Ordia benytter Wikidata Query Service er det muligt at skabe sider på Ordia der samler information fra forskellige sider af Wikidata. I Ordia kan man for eksempel få en liste over alle bevægelsesverber eller navneord. Ordia har også en tekst-til-leksemer-funktion hvor man kan indtaste en tekst. Webapplikationen vil  udtrække ordene fra teksten, lave en forespørgsel mod Wikidata Query Service med ordene og vise matchede leksemformer og deres betydninger.

Der er stadig mange uklare elementer og åbne spørgsmål ved annoteringen af leksemerne. For eksempel, er den måde vi angiver at et verbum er et anbringelsesverbum brugbar? Skal den transitive og intransive udgave af verbet “hængte” være en eller to leksemer? Skal vi angive oversættelse ved hver enkelt betydning? Skal dansk s-genitiv angives i Wikidata? Kan vi med Wikidata specificere grammatik, således at det på sigt ville være muligt at skabe en grammatiktjekker? Hvad kan Wikidata leksemerne i det hele taget bruges til?

Ordia: Suggestion for a lightning talk at WikidataCon 2019

Posted on Updated on

Ordia is a Wikidata front-end running on the Wikimedia Toolforge https://tools.wmflabs.org/ordia/. Ordia displays information about the lexemes of Wikidata, including their forms and senses. It makes use of the Wikidata Query Service and can thus aggregate information from various different Wikidata pages. For instance, the language aspect shows statistics for the number of lexemes, forms and senses with respect to languages. Ordia also shows overviews over lexical categories, grammatical features, properties and the use of references. If a user input a text into a specific input field, Ordia can extract the individual words and query for the individual words. This talk will demonstrate the various uses of Ordia and briefly discuss the status of Wikidata lexemes.

Ideal number of attendees: 20

Take away: Attendees will know how to use Ordia and the limitation of Ordia and Wikidata lexemes.