Month: October 2021

First experiments with the T0 Hugging Face language model

Posted on

The T0 models was released here in October 2021, available via Hugging Face, see bigscience/T0pp, and described in the paper Multitask Prompted Training Enables Zero-Shot Task Generalization (Scholia). The researchers behind the model claims “The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.”

The small language model, T0_3B, contains 3 milliard parameters and fills up 11 gigabyte of disk space at ~/.cache/huggingface/transformers/a80e28…

After setup of protobuf, torch and transformers, the model can be autodownloaded and test can be run. On the Hugging Face webpage, there is a few lines of Python code with a sentiment analysis example, here converted to use the small model and edited:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy", return_tensors="pt"))[0]))
<pad> Positive</s>

It is unclear to me how well these large pre-trained language models speak other languages than English. My knowledge of prompt engineering is also limited. So the below examples are my first-shot naive attempts:

>>> print(tokenizer.decode(model.generate(tokenizer.encode("Hvem er statsminister i Danmark?", return_tensors="pt"))[0]))
<pad> <unk>ystein Svensson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Danish:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("English: Copenhagen is the capital. Translate to French:", return_tensors="pt"))[0]))
<pad> Copenhagen is the capital of Denmark.</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("What city is the second largest in Denmark?", return_tensors="pt"))[0]))
<pad> Copenhagen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote Der er et yndigt land?", return_tensors="pt"))[0]))
<pad> Theodore Roosevelt</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote the song 'She loves you'?", return_tensors="pt"))[0]))
<pad> John Lennon and Paul McCartney</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Who wrote 'Der er et yndigt land'?", return_tensors="pt"))[0]))
<pad> <unk>sgeir <unk>sgeirsson</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: Who wrote 'Der er et yndigt land'? Answer:", return_tensors="pt"))[0]))
<pad> Henrik Ibsen</s>
>>> print(tokenizer.decode(model.generate(tokenizer.encode("Question: What city is the second largest in Denmark? Answer:", return_tensors="pt"))[0]))
<pad> Copenhagen</s>

Each of these answers takes over 20 seconds to complete on the initial CPU-based system I used.

On the non-Danish question “Who wrote the song ‘She loves you’?” the model gets the answer right, but for the Danish questions it fails.

For the question “Who wrote Der er et yndigt land?”, i.e., who wrote the Danish national anthem, T0_3B answers incorrectly Theodore Roosevelt or Henrik Ibsen depending on the prompt, while the Google search engine returns “Adam Oehlenschläger” for me.

The question can be converted to SPARQL for submission to the Wikidata Query Service:

SELECT ?who {
  ?work rdfs:label 'Der er et yndigt land'@da ;   
        ( wdt:P50 | wdt:P676 | wdt:P86 | wdt:P58 | wdt:P2679 | wdt:P2680 ) / rdfs:label ?who .
  FILTER (LANG(?who) = 'en')
}

The result there is

Adam Oehlenschläger
Morten Arnfred
Jørgen Ljungdalh
Hans Ernst Krøyer

Oehlenschläger is the author of the text, Krøyer the composer, Arnfred and Ljungdalh is the screenwriter of a Danish film with the same title as the anthem.

Simon Razniewski (Scholia), Gerhard Weikum (Scholia) and et al. have recently published their DL4KG 2021 paper Language Models As or For Knowledge Bases (Scholia) where they contemplate over the limitations and advantages of language models vs. knowledge bases/graphs. They have had access to the GPT-3 language model:

Example: GPT-3 does not have tangible knowledge that Alan Turing was born in London; it merely assigns this a high confidence of 83%. Yann LeCun, on the other hand, is given medium confidence in being a citizen of France and Canada (67% and 26%), but he actually has French and USA citizenship, not Canadian. The LM assigns USA a very low score. The Wikidata KB, on the other hand, only states his French citizenship, not USA. Wikidata is incomplete, but it does not contain any errors.

Language Models As or For Knowledge Bases, page 2