|.NET Regular Expressions in depth|
|Written by Mike James|
|Tuesday, 15 April 2014|
Page 2 of 5
Of course we now have the problem that it isn’t unreasonable for an ISBN to be written as ISBN: 9 or ISBN:9 with perhaps even more than one space after the colon.
We clearly need a way to specify the number of repeats that are allowed in a matching string.
To do this we make use of “quantifiers” following the specification to be repeated.
The most commonly used quantifiers are:
In many ways this is the point at which regular expression use starts to become interesting and inevitably more complicated. You could even say that the use of * and + is what makes a regular expression regular in the sense of a regular grammar.
Things are easy with simple examples not hard to find. For example:
matches “ISBN:” followed by any number of white-space characters including none at all followed by a digit. Similarly:
matches “ISBN” followed by an optional colon, any number of white-space characters including none followed by a digit.
Quantifiers are easy but there is a subtlety that often goes unnoticed.
Quantifiers, by default, are “greedy”.
That is they match as many entities as they can even when you might think that the regular expression provides a better match a little further on. The only way to really follow this is by the simplest example.
Suppose you need a regular expression to parse some HTML tags:
If you want to match just a pair of opening and closing tags you might well try the following regular expression:
which seems to say “the string starts with <div> then any number including zero of other characters followed by </div>”. If you try this out on the example given above you will find that it matches:
However if you now try it out against the string:
you will discover that the match is to the entire string.
That is the final </div> in the regular expression is matched to the final </div> in the string even though there is an earlier occurrence of the same substring.
This is because the quantifiers are greedy by default and attempt to find the longest possible match. In this case the .* matches everything including the first </div>. So why doesn’t it also match the final </div>? The reason is that if it did the entire regular expression would fail to match anything because there would be no closing </div>.
What happens is that the quantifiers continue to match until the regular expression fails, then the regular expression engine backtracks in an effort to find a match.
Notice that all of the standard quantifiers are greedy and will match more than you might expect based on what follows in the regular expression.
If you don’t want greedy quantifiers the solution is to use “lazy” quantifiers which are formed by following the standard quantifiers by a question mark.
To see this in action, change the previous regular expression to read:
With this change in place the result of matching to
is just the first pair of <div> brackets – that is <div>hello</div>.
Notice that all of the quantifiers, including ?, have a lazy version and yes you can write ?? to mean a lazy “zero or one” occurrence.
The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings.
Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.
Grouping and alternatives
Regular strings often have alternative forms. For example the ISBN designator could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations. You can specify an either/or situation using the vertical bar |, the alternation operator as in x|y which will match an x or a y.
matches either ISBN: or ISBN-13:. This is easy enough but what about:
At first glance this seems to match either ISBN: or ISBN-13 followed by any number of white space characters and a single digit – but it doesn’t.
The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d.
To match the white space and digit in both forms of the ISBN suffix we would have to write:
Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in. Anything grouped between parentheses is treated as a single unit – and grouping has a higher priority than the alternation operator.
So for example:
matches either form of the ISBN suffix followed by any number of white space characters and a single digit because the brackets limit the range of the alternation operator to the substrings to the left and right within the bracket.
The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous un-grouped expression but without the colon:
In this case the first pattern, i.e. “ISBN”, will match even if the string is “ISBN-13”. It doesn’t matter that the second expression is a “better” match. No amount of grouping will help with this problem because the shorter match will be tried and succeed first.
In this case the solution is to either swap the order of the sub-expressions so that the longer comes first or include something that always marks the end of the target string. For example, in this case if we add the colon then the
subexpression cannot possibly match the ISBN-13: string.
|Last Updated ( Tuesday, 15 April 2014 )|