.NET Regular Expressions in depth
Wednesday, 18 August 2010 00:00
Article Index
.NET Regular Expressions in depth
Anchors
Groups
Back references
Reduction

Banner

Back references

So far so good but what can you use captures for?

The answer is two-fold – more sophisticated regular expressions and replacements.

Let’s start with their use in building more sophisticated regular expressions.

Using the default numbering system described above you can refer to a previous capture in the regular expression.

That is, if you write  \n where n is the number of a capture group the expression will specify that value of the capture group – confused?

It’s easy once you have seen it in action. 

Consider the task of checking that html tags occur in the correct opening and closing pairs. That is, if you find a <div> tag the next closing tag to the right should be a <\div>. You can already write a regular expression to detect this condition but captures and back references make it much easier.

If you start the regular expression with a sub expression that captures the string within the brackets then you can check that the same word occurs within the closing bracket using a back reference to the capture group:

Regex ex2= new Regex(@"<(div)></\1");

Notice the \1 in the final part of the expression tells the regular expression engine to retrieve the last match of the first capture group. If you try this out you will find that it matches <div><\div> but not <div><\pr>, say.

You could have done the same thing without using a back reference but its easy to extend the expression to cope with additional tags.

For example :

Regex ex2= new Regex(
@"<(div|pr|span|script)></\1>");

matches correctly closed div, pr, span and script tags.

If you are still not convinced of the power of capture and back reference try and write a regular expression that detects repeated words without using them. 

The solution using a back reference is almost trivial:

Regex ex2= new Regex(@"\b(\w+)\s+\1\b");

The first part of the expression simply matches a word by the following process – start at word boundary capture as many word characters as you can, then allow one or more white space characters. Finally check to see if the next word is the same as the capture.

The only tricky bit is remembering to put the word boundary at the end. Without it you will match words that repeat as a suffix as in “the theory”.

As well as anonymous captures you can also create named captures using:

(?<name>regex)

or

(?’name’regex)

You can then refer the capture by name using the syntax

\<name>

or

\’name’

Using a named capture our previous duplicate word regular expression can be written as:

@"\b(?<word>\w+)\s+\<word>\b"

If you need to process named captures outside of a regular expression, i.e. using the Capture classes, then you still have to use capture numbers and you need to know that named captures are numbered left to right and outer to inner after all the unnamed captures have been numbered.

If you need to group items together but don’t want to make use of a capture you can use:

(?:regex)

This works exactly as it would without the ?: but the bracket is left out of the list of capture groups. This can improve the efficiency of a regular expression but this usually isn’t an issue.

Advanced capture

There other capture group constructs but these are far less useful and, because they are even more subtle, have a reputation for introducing bugs. The balancing group is, however, worth knowing about as it gives you the power to balance brackets and other constructs but first we need to know about a few of the other less common groupings  – the assertions.

There are four of these and the final three are fairly obvious variations on the first. They all serve to impose a condition on the match without affecting what is captured

Zero-width positive lookahead assertion

(?=regex)

This continues the match only if the regex matches on the immediate right of the current position but doesn’t capture the regex or backtrack if it fails. For example,

\w+(?=\d)

only matches a word ending in a digit but the digit is not included in the match. That is it matches Paris9 but returns Paris as capture 0. In other words, you can use it to assert a pattern that must follow a matched subexpression.

Zero-width negative lookahead assertion

(?!regex)

This works like the positive lookahead assertion but the regex has to fail to match on the immediate right. For example:

\w+(?!\d)

only matches a word that doesn’t have a trailing digit.

Zero-width positive lookbehind assertion

(?<=regex)

Again this works like the positive lookahead assertion but it the regex has to match on the immediate left.

For example:

(?<=\d)\w+

only matches a word that has a leading digit.

Zero-width negative lookbehind assertion.

(?<!regex)

This is just the negation of the Zero-width positive lookbehind assertion.

For example:

(?<!\d)\w+

only matches a word that doesn’t have a leading digit.

Now that we have seen the assertions we can move on to consider the balancing group:

(?<name1-name2>regex)

This works by deleting the current capture from the capture collection for name2 and storing everything since the last capture in the capture collection for name1. If there is no current capture for name2 then backtracking occurs and if this doesn’t succeed the expression fails. 

<ASIN:0123745144>
<ASIN:0672331012>
<ASIN:1933988711>
<ASIN:0321694694>
<ASIN:073562710X>
<ASIN:047043452X>
<ASIN:0596800959>
<ASIN:1430225491>



Last Updated ( Wednesday, 18 August 2010 12:39 )
 
 

   
RSS feed of all content
I Programmer - full contents
Copyright © 2013 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.