xkcd 1313: Regex Golf

I found that the hover text, “/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents.“, contains a confusing contradiction. There are several last names (like “Nixon”) that denote both elected presidents and opponents. So no regular expression could both match and not match “Nixon”. I could only assume that Randall meant for these names to be winners and not losers (and in fact he later confirmed that was the correct interpretation).

So that got me thinking: can I come up with an algorithm to find a short regex that covers the winners and not the losers?

I started by finding a page that lists winners and losers of US presidential elections through 2000. Adding the 2004-2012 results I get:  …

via  xkcd 1313: Regex Golf

Apparently there is a Regex Golf game.

Type a regex in the box. You get ten points per correct match. Hit Enter to go to the next ‘level’.

Natural Language Toolkit — NLTK 2.0 documentation

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

via Natural Language Toolkit — NLTK 2.0 documentation.

From: http://www.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python/

NLP is a highly interdisciplinary field of study comprising of concepts and ideas from Mathematics, Computer Science and Linguistics. Naturally occurring instances of human language, be it text or speech, are growing at an exponential rate given the popularity of the Web and social media. In addition, people are increasingly becoming more and more reliant on internet services to search, filter, process and, in some cases, even understand the subset of such instances they encounter in their daily lives.

NLP = Natural Language Processing

Python e-book error

Python will turn your everyday binary strings into Unicode strings when necessary. But things get trickier if you put non-ASCII characters in Byte strings.

via. http://lobstertech.com/python_unicode.html

The ebook reader in linux was written in python and chokes on a lot of abnormal characters and I think the above is the reason why.  This could be a problem with the nook and other ebook readers but I’m not sure and it’s really not a priority to find out.   I know the nook will not accept an epub with a malformed stylesheet.css even though the python linux e-reader will.

It will be interesting to find out how some of the bigger tablets handle this.