The Trick Of The Mind - Regular Little Language |
Written by Mike James | ||||||
Tuesday, 01 July 2025 | ||||||
Page 2 of 2
Does the expression A*B*A* match strings like BBAA, i.e. where there are no starting As? With regular expressions it is often as important to know what they don’t match as what they do match. The asterisk does indeed match zero occurrences of the letter as well as one or more. What this means is that A*B*A* will match BAA, AAAA, AAAABB, BBB and so on. In other words, the regular expression doesn’t really capture the idea of some As followed by some Bs followed by some As. If you only want to match a “true” ABA pattern then you need the plus sign quantifier which demands at least one match. So for example, A+B+C+ needs at least one A, B and C. So the shortest string it matches is ABC and it also matches AABBCC, AAABC and more, but not AACC ,for example, where a letter is missing , i.e. occurs zero times. This difference between zero things and one thing is often important in programming and you will meet many examples of the distinction. Here it is as part of regular expressions, where the asterisk matches any number of items including zero and the plus matches any number of items excluding zero. It all depends what you mean by “any number of items”. It is often said that only a mathematician would include zero in “any number of items”, but clearly programmers also sometimes have to. The final quantifier is the question mark which matches zero or one occurrence of the character. For example, A?B?C? matches ABC and A, B, C, AB, BC and AC, but not AABBCC etc. The three quantifiers *,+ and ? are the original ones defined as part of regular expressions, but today you can write {min,max} to specify an exact range of repeats. For example, A{1,1}B{2,2}A{1,1} matches only ABBA but A{1,2}B{2,2}A{1,2} also matches ABBAA. AABBA and AABBBB. There are many more special characters you can use to specify what matches the string you are looking for and, as regular expressions are a real little language used in many other programming languages, you can look them up in documentation. These characters can be used in combination much the same way that +, - , * and / are used in arithmetic expressions to powerful effect. For example: ^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$ is a regular expression that matches a valid US phone number. You are not expected to understand how this expression works, just to realize that regular expressions can be more sophisticated than our simple examples. This particular regular expression also isn’t foolproof, there are telephone formats that it will fail to match, and this is one of the problems with using regular expressions. As already mentioned, knowing what a particular regular expression doesn’t match can be as important as knowing what it does match. Complex regular expressions are dangerous because they often hide subtle match conditions that don’t become apparent until something goes wrong. They tend to be buggy and error prone. Creating regular expressions to do a particular task is also a fun activity much like doing Sudoku puzzles or other word games and there are recreational websites devoted to them. Why A “Little” Language?So far we have used the term “little language” and suggested that there are things such a language cannot express, but haven’t really explained what these things might be. Let’s look at a particular problem for the regular expression language that it cannot solve. Let’s suppose we want to match a simple palindrome, a string that reads the same left-to-right and right-to-left, like ABA, AABAA, AAABAAA and so on. Such a string is easy enough to define by saying that it has the same number of As before the B as follow it, but there is no regular expression that can match this class of strings. You can match an example from this category. For example A{2,2}BA{2,2} matches AABAA, but it only matches this example. You can write a similar regular expression for any number of As on both sides of the B but there is no way to write an expression that simply checks that both sides of the B have the same number of As. The reason that this cannot be done is that regular expressions don’t carry information from one part of the attempted match to the another. There is no way a regular expression can count the number of As on the left of the B and compare this to the count of the number of As on the right. You can write a regular expression for each possible number of As in turn, i.e. our example tests for two As, but there is no single expression that does the job for any matching number of As. If you wanted to cover all of the possibilities you would need an infinite number of regular expressions, one for each number of As that make this palindrome and this is not practical. It is important to know about such limitations if only to avoid spending time trying to make regular expressions or another little language do more that it can. You can extend the definition of a regular expression to make it do this task but then it would no longer be a little language, but a full programming language and it would hence lose much of its charm and usefulness. Little languages are limited, but they still have more than enough power for their simplicity. Summary
To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Comments
or email your comment to: comments@i-programmer.info |
||||||
Last Updated ( Tuesday, 01 July 2025 ) |