The Trick Of The Mind - Regular Little Language
Written by Mike James   
Tuesday, 01 July 2025
Article Index
The Trick Of The Mind - Regular Little Language
star operator

Does the expression A*B*A* match strings like BBAA, i.e. where there are no starting As? With regular expressions it is often as important to know what they don’t match as what they do match. The asterisk does indeed match zero occurrences of the letter as well as one or more. What this means is that A*B*A* will match BAA, AAAA, AAAABB, BBB and so on. In other words, the regular expression doesn’t really capture the idea of some As followed by some Bs followed by some As. If you only want to match a “true” ABA pattern then you need the plus sign quantifier which demands at least one match. So for example, A+B+C+ needs at least one A, B and C. So the shortest string it matches is ABC and it also matches AABBCC, AAABC and more, but not AACC ,for example, where a letter is missing , i.e. occurs zero times.

This difference between zero things and one thing is often important in programming and you will meet many examples of the distinction. Here it is as part of regular expressions, where the asterisk matches any number of items including zero and the plus matches any number of items excluding zero. It all depends what you mean by “any number of items”. It is often said that only a mathematician would include zero in “any number of items”, but clearly programmers also sometimes have to.

The final quantifier is the question mark which matches zero or one occurrence of the character. For example, A?B?C? matches ABC and A, B, C, AB, BC and AC, but not AABBCC etc.

The three quantifiers *,+ and ? are the original ones defined as part of regular expressions, but today you can write {min,max} to specify an exact range of repeats. For example, A{1,1}B{2,2}A{1,1} matches only ABBA but A{1,2}B{2,2}A{1,2} also matches ABBAA. AABBA and AABBBB.

There are many more special characters you can use to specify what matches the string you are looking for and, as regular expressions are a real little language used in many other programming languages, you can look them up in documentation. These characters can be used in combination much the same way that +, - , * and / are used in arithmetic expressions to powerful effect. For example:

^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$

is a regular expression that matches a valid US phone number. You are not expected to understand how this expression works, just to realize that regular expressions can be more sophisticated than our simple examples. This particular regular expression also isn’t foolproof, there are telephone formats that it will fail to match, and this is one of the problems with using regular expressions.

As already mentioned, knowing what a particular regular expression doesn’t match can be as important as knowing what it does match. Complex regular expressions are dangerous because they often hide subtle match conditions that don’t become apparent until something goes wrong. They tend to be buggy and error prone.

Creating regular expressions to do a particular task is also a fun activity much like doing Sudoku puzzles or other word games and there are recreational websites devoted to them.

Why A “Little” Language?

So far we have used the term “little language” and suggested that there are things such a language cannot express, but haven’t really explained what these things might be. Let’s look at a particular problem for the regular expression language that it cannot solve.

Let’s suppose we want to match a simple palindrome, a string that reads the same left-to-right and right-to-left, like ABA, AABAA, AAABAAA and so on. Such a string is easy enough to define by saying that it has the same number of As before the B as follow it, but there is no regular expression that can match this class of strings.

You can match an example from this category. For example A{2,2}BA{2,2} matches AABAA, but it only matches this example. You can write a similar regular expression for any number of As on both sides of the B but there is no way to write an expression that simply checks that both sides of the B have the same number of As.

The reason that this cannot be done is that regular expressions don’t carry information from one part of the attempted match to the another. There is no way a regular expression can count the number of As on the left of the B and compare this to the count of the number of As on the right. You can write a regular expression for each possible number of As in turn, i.e. our example tests for two As, but there is no single expression that does the job for any matching number of As. If you wanted to cover all of the possibilities you would need an infinite number of regular expressions, one for each number of As that make this palindrome and this is not practical.

It is important to know about such limitations if only to avoid spending time trying to make regular expressions or another little language do more that it can.

You can extend the definition of a regular expression to make it do this task but then it would no longer be a little language, but a full programming language and it would hence lose much of its charm and usefulness. Little languages are limited, but they still have more than enough power for their simplicity.

Summary

  • Little languages often occur as part of complete computer languages and they are interesting to find out about. Noticing that something is a little language is often all you need to realize how to master it.

  • Arithmetic is a sophisticated little language and its priority rules are the reason so many students struggle to understand it.

  • Rules like BODMAS, especially brackets, help get arithmetic right.

  • In the early days of programming it wasn’t obvious how to convert an arithmetic expression into an equivalent set of instructions. The first computer language to do this was FORTRAN in the late 1950s.

  • The reason for the complicated priority system is so as to make the distributive law easy to work with.

  • Regular expressions are another very common little language found in most big programming languages.

  • A regular expression is a specification for a target string that you are looking for in a larger string.

  • Quantifiers *, + and ? are used to control the number of times a character can match.

  • Regular expressions can become very complex and difficult to understand.

  • They form a little language because they do not allow details of earlier matches to affect later matches and this means that there are patterns for which it is impossible to create regular expressions.

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Linux Foundation Launches Agent2Agent Project
24/06/2025

The Linux Foundation has launched the Agent2Agent (A2A) project, an open protocol created by Google for secure agent-to-agent communication and collaboration.



Windows 11 Overtakes Windows 10 - But Not In Europe
08/07/2025

With the end of support of Windows 10 just three months away, Windows 11 has finally edged ahead of Windows 10 in terms of  Desktop Windows Version Market Share on a Worldwide Basis. In Europe, h [ ... ]


More News

pico book

 

Comments




or email your comment to: comments@i-programmer.info



Last Updated ( Tuesday, 01 July 2025 )