Not to match with regular expression?

Posted on Updated on

Apparently, it is not excessively easy to use Perl-based regular expressions to match expressions with that does not contain a combination of characters.

One case where this is relevant is in Twitter where you want to find “mentions” in the Twitter text. Mentions may be indicated with a ‘@user’. If you do not want to include retweets as “mentions” you need to exclude these tweets. Retweets are usually indicated with “RT @user”. In this case you want to find instances of “@user” that are not preceeded with “RT” or any of its variants, e.g., “RT: “. The problem occurs in the article Want to be retweeted? Large scale Analytics on factors impacting retweet in Twitter network. See also my previous post Twitter retweet analysis.

My first attempts on the non-matching problem with Python re module are here:

>>> import re
>>> re.findall(r"(?:bRT:?s*){0}@(w+)", "@anders RT @bjarne")['anders', 'bjarne']
>>> re.findall(r"(?:RT:?s*){0,0}@(w+)", "@anders RT @bjarne") ['anders', 'bjarne']

Here the task is to match “anders” and not “bjarne”, and there is no success. The perlre manual turns out to be of some help. There is the “zero-width negative look-ahead” which is written with this code: “(?!pattern)”. What you want is, however, a negative look-behind. That one is written with “(?<!pattern)”. However, these patterns only work for fixed-width look-behind. So you could write the following regular expression which is not perfect, but covers quite a good percentage of tweets:

>>> re.findall(r"(?<!bRT )@(w+)", "@anders RT @bjarne")['anders']

It is not easy to circumvent the fixed-width problem. The following two examples wont work:

>>> re.findall(r"(?<!bRT)s*@(w+)", "@anders RT @bjarne")['anders', 'bjarne']
>>> re.findall(r"(?:(?<!bRT )|(?<!bRT: ))@(w+)", "@anders RT @bjarne")['anders', 'bjarne']

Inspired by the perlre manual and its suggestion “if (/bar/ && $` !~ /foo$/)” you can do something similar with two regular expressions:

>>> [s[1:] for s in re.findall(r"((?:bRT:?s*)?@w+)", "@anders RT @bjarne") if not re.match(r"^RT", s) ]['anders']

Not necessarily pretty.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s