How to quickly generate word analogy datasets with Wikidata

Posted on Updated on

One popular task in computational linguistics/natural language processing is the word analogy task: Copenhagen is to Denmark as Berlin is to …?

With queries to Wikidata Query Service (WDQS) it is reasonably easy to generate word analogy datasets in whatever (Wikidata-supported) language you like. For instance, for capitals and countries, a WDQS SPARQL query that returns results in Danish could go like this:

select
  ?country1Label ?capital1Label
  ?country2Label ?capital2Label
where { 
  ?country1 wdt:P36 ?capital1 .
  ?country1 wdt:P463 wd:Q1065 .
  ?country1 wdt:P1082 ?population1 .
  filter (?population1 > 5000000)
  ?country2 wdt:P36 ?capital2 .
  ?country2 wdt:P463 wd:Q1065 .
  ?country2 wdt:P1082 ?population2 .
  filter (?population2 > 5000000)
  filter (?country1 != ?country2)
  service wikibase:label
    { bd:serviceParam wikibase:language "da". }  
} 
limit 1000

Follow this link to get to the query and press “Run” to get the results. It is possible to download the table as CSV-formatted (see under “Download”). One issue to note that you have multiple entries for countries with multiple capital cities, e.g., Sydafrika (South Africa) is listed with Pretoria, Kapstaden (Cape Town) and Bloemfontein.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s