.NET Regular Expressions in depth
Wednesday, 18 August 2010 00:00
Article Index
.NET Regular Expressions in depth
Anchors
Groups
Back references
Reduction

Banner

Anchors

As well as characters and character sets you can also use location matches or anchors.

For example, the ^ (caret) only matches the start of the string. For example, @"^ISBN:"

will only match if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else. The most useful anchors are:

  •          start of string
  • $          end of string
  • \b         word boundary – i.e. between a \w and \W
  • \B        anywhere but a word boundary

So for example: 

 @"^\d+$"

specifies a string consisting of nothing but digits. Compare this to

 @"^\d*$"

which would also accept a null string.

One subtle point only emerges when you consider strings with line breaks.

In this case by default the ^ and $ match only the very start and end of the string.

If you want them to match line beginnings and endings you have to specify the /m option. It’s also worth knowing about the \G anchor which only matches at the point where the previous match ended – it is only useful when used with the NextMatch method but then it makes all matches contiguous.

Quantify

Of course we now have the problem that it isn’t unreasonable for an ISBN to be written as ISBN: 9 or ISBN:9 with perhaps even more than one space after the colon.

We clearly need a way to specify the number of repeats that are allowed in a matching string.

To do this we make use of “quantifiers” following the specification to be repeated.

The most commonly used quantifiers are:

  • *            zero or more
  • +           one or more
  • ?           zero or one
  • {n}        exactly n times
  • {n,}       n or more times
  • {n,m}    at least n at most m times

In many ways this is the point at which regular expression use starts to become interesting and inevitably more complicated.

Things are easy with simple examples not hard to find. For example:

 @"ISBN:\s*\d"

matches “ISBN:” followed by any number of white-space characters including none at all followed by a digit. Similarly:

 @"ISBN:?\s*\d"

matches “ISBN” followed by an optional colon, any number of white-space characters including none followed by a digit.

Quantifiers are easy but there is a subtlety that often goes unnoticed.

Quantifiers, by default, are “greedy”.

That is they match as many entities as they can even when you might think that the regular expression provides a better match a little further on. The only way to really follow this is by the simplest example.

Suppose you need a regular expression to parse some HTML tags:

<div>hello</div>

If you want to match just a pair of opening and closing tags you might well try the following regular expression:

Regex ex2= new Regex(@"<div>.*</div>");

which seems to say “the string starts with <div> then any number including zero of other characters followed by </div>”. If you try this out on the example given above you will find that it matches:

MessageBox.Show(
ex2.Match(@"<div>hello</div>").
ToString());

However if you now try it out against the string:

<div>hello</div><div>world</div> 

as in:

MessageBox.Show(
ex2.Match(
@"<div>hello</div><div>world</div>").
ToString());

you will discover that the match is to the entire string.

That is the final </div> in the regular expression is matched to the final </div> in the string even though there is an earlier occurrence of the same substring.

This is because the quantifiers are greedy by default and attempt to find the longest possible match. In this case the .* matches everything including the first </div>. So why doesn’t it also match the final </div>? The reason is that if it did the entire regular expression would fail to match anything because there would be no closing </div>.

What happens is that the quantifiers continue to match until the regular expression fails, then the regular expression engine backtracks in an effort to find a match.

Notice that all of the standard quantifiers are greedy and will match more than you might expect based on what follows in the regular expression. If you don’t want greedy quantifiers the solution is to use “lazy” quantifiers which are formed by following the standard quantifiers by a question mark.

To see this in action, change the previous regular expression to read:

Regex ex2= new Regex(@"<div>.*?</div>");

With this change in place the result of matching to

@"<div>hello</div>world</div>"

is just the first pair of <div> brackets – that is <div>hello</div>.

Notice that all of the quantifiers, including ?, have a lazy version and yes you can write ?? to mean a lazy “zero or one” occurrence.

The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings. Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.

<ASIN:0262201755>
<ASIN:0470563486>
<ASIN:0471081124>
<ASIN:0321741765>
<ASIN:0470539437>
<ASIN:0470467274>
<ASIN:1430225378>



Last Updated ( Wednesday, 18 August 2010 12:39 )
 
 

   
RSS feed of all content
I Programmer - full contents
Copyright © 2013 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.