A recent xkcd cartoon has started some deep academic thinking. When AI expert Peter Norvig gets involved you know the algorithms are going to fly.
Xkcd is our favourite comic and regular expression are definitely our favourite way to spend the time waiting for the next installment - so what could be better than an xkcd cartoon on Regex Golf.
Code Golf is a reasonably well known sport of trying to code an algorithm in the shortest possible code. Regex Golf is similar, but in general the aim is to create a regular expression that accepts the strings in one list and rejects the strings in a second list.
This should have been enough, but clearly we are programmers and we like recursion and a regex is a string after all and a regex can process a string:
Yes my friend, it's regexes all the way down!
Mind boggling enough, but the hover over text, a key feature of many xkcd funnies, is where the problem really starts to take off and gain a life of its own.
The text gives a regular expression that matches the last names of the elected US presidents, but not the losers.
This started Peter Norvig, the well-known computer scientist, director of research at Google and wearer of brightly colored shirts, thinking about the problem. Is it possible to write a program that would create a regular expression to solve the xkcd problem?
The problem is one level of meta, i.e. it is a program that creates a regular expression, but it is a difficult one that needs AI-like search techniques.
The basic algorithm is to first derive a pool of regexs for short character sequences in the data that match a least one winner but no losers. Then create a bigger regex by or-ing a selection of components together. Vary the selection until you get a regex that matches all of the winners but none of the losers.
This is equivalent to the set cover problem and hence NP hard.
To try to find the smallest regex that does the job the simplest approximate way is to use a greedy selection that picks component regexs that maximally separate the two lists first.
From here the algorithm can be improved by adding a set of rules for generating the initial pool of regexs.
To find out more read the complete description, including Python code, at Peter Norvig's blog post which ends with the challenge:
"I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this."
Or, of course, you could take the problem to one stage more of meta e.g meta-meta-regexgolf - a program that writes programs that writes regexs....
If you would like to play more Regex Golf, with or without the help of a program, visit: http://regex.alf.nu/