Part-of-speech tagging with NLTK in Python and "Jensen and Sons"

Posted on Updated on

One old joke is about Mr. Jensen that has a company with his two sons. Mr. Jensen orders a new sign to put up in front of his shop displaying “Jensen and Sons”. The sign painter sends Mr. Jensen a draft. However, Mr. Jensen is not satisfied with the position of the letters and he writes back to the painter: “I would like to have more space between Jensen and and and and and Sons”.

Our course in Python programming (Technical University of Denmark course 02820) is presently running. One of the topics of the course is text and web mining with the associated slides available.

Natural Language Toolkit (NLTK) Python module provides convenient support for some text mining approaches. Part-of speech tagging (POS-tagging) is part of NLTK and it is fairly straightforward to use. Working on the joke the Python code goes like this:

>>> import nltk 
>>> s = "I would like to have more space between Jensen and and and and and Sons"
>>> words = nltk.word_tokenize(s)
>>> nltk.pos_tag(words) 
[('I', 'PRP'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('have', 'VB'), ('more', 'JJR'), ('space', 'NN'), ('between', 'IN'), ('Jensen', 'NNP'), ('and', 'CC'), ('and', 'CC'), ('and', 'CC'), ('and', 'CC'), ('and', 'CC'), ('Sons', 'NNPS')]

‘And’ is tagged as a ‘coordinating conjunction’. I am not sure NLTK sees the joke there.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s