.NET Regular Expressions In Depth

Written by Mike James

Thursday, 16 July 2020

Article Index
.NET Regular Expressions In Depth
Quantifiers
Capture
Back references
Reduction

Page 4 of 5

Back references

So far so good but what can you use captures for?

The answer is two-fold – more sophisticated regular expressions and replacements.

Let’s start with their use in building more sophisticated regular expressions.

Using the default numbering system described above you can refer to a previous capture in the regular expression.

That is, if you write \n where n is the number of a capture group the expression will specify that value of the capture group – confused?

It’s easy once you have seen it in action.

Consider the task of checking that html tags occur in the correct opening and closing pairs. That is, if you find a <div> tag the next closing tag to the right should be a <\div>. You can already write a regular expression to detect this condition but captures and back references make it much easier.

If you start the regular expression with a sub expression that captures the string within the brackets then you can check that the same word occurs within the closing bracket using a back reference to the capture group:

Regex ex2= new Regex(@"<(div)></\1");

Notice the \1 in the final part of the expression tells the regular expression engine to retrieve the last match of the first capture group. If you try this out you will find that it matches <div><\div> but not <div><\pr>, say.

You could have done the same thing without using a back reference but its easy to extend the expression to cope with additional tags.

For example :

Regex ex2= new Regex( @"<(div|pr|span|script)></\1>");

matches correctly closed div, pr, span and script tags.

If you are still not convinced of the power of capture and back reference try and write a regular expression that detects repeated words without using them.

The solution using a back reference is almost trivial:

Regex ex2= new Regex(@"\b(\w+)\s+\1\b");

The first part of the expression simply matches a word by the following process – start at word boundary capture as many word characters as you can, then allow one or more white space characters. Finally check to see if the next word is the same as the capture.

The only tricky bit is remembering to put the word boundary at the end. Without it you will match words that repeat as a suffix as in “the theory”.

As well as anonymous captures you can also create named captures using:

(?<name>regex)

(?’name’regex)

You can then refer the capture by name using the syntax

\<name>

\’name’

Using a named capture our previous duplicate word regular expression can be written as:

@"\b(?<word>\w+)\s+\<word>\b"

If you need to process named captures outside of a regular expression, i.e. using the Capture classes, then you still have to use capture numbers and you need to know that named captures are numbered left to right and outer to inner after all the unnamed captures have been numbered.

If you need to group items together but don’t want to make use of a capture you can use:

(?:regex)

This works exactly as it would without the ?: but the bracket is left out of the list of capture groups. This can improve the efficiency of a regular expression but this usually isn’t an issue.

regex

Reduction

In many cases all you are doing is trying to reduce the capture count for name2 and in this case you can leave out any reference to name1.

This sounds complicated but in practice it isn’t too difficult.

For example, let’s write an expression that matches any number of As followed by the same number of Bs:

Regex ex3 = new Regex( @"^(?<COUNT>A)+(?<-COUNT>B)+");

This works, up to a point, in that it matches equal number of A and Bs starting from the beginning of the string but it doesn’t reject a string like AABBB which it simply matches to AABB.

Each time the first capture group hits an A it adds a capture to the capture set - so in this case there are two captures when the second capture group hits the first B. This reduces A’s capture set to 1 and then to zero when the second B is encountered which causes the match to backtrack to the second B when the third B is encountered and the match succeeds.

To make the entire match fail we also have to include the condition that we should now be at the end of the string.

Regex ex3 = new Regex( @"^(?<COUNT>A)+(?<-COUNT>B)+$");

This now fails on AABBB but it matches AAABB because in the case the second capture group doesn’t fail before we reach the end of the string.

We really need a test that amounts to

“at the end of the string/match the count capture group should be null”

To do this we need some sort of conditional test on the capture and .NET provides just this:

(?(name)regex1|regex2)

will use regex1 if the capture is non-empty and regex2 if it is empty.

In fact this conditional is more general than this in that name can be a general regular expression. You can leave regex2 out if you want an “if then” rather than an “if then else”.

With this our new expression is:

Regex ex3 = new Regex(@"^(?<COUNT>A)+(?<-COUNT>B)+ (?(COUNT)^.)$");

The ^. is doesn’t match any character and so it forces the match to fail if the capture group isn’t empty.

A more symmetrical if…then…else form of the same expression is:

Regex ex3 = new Regex(@"^(?<COUNT>A)+(?<-COUNT>B)+ (?(COUNT)^.|(?=$))");

In this case the else part of the conditional asserts that we are at the end of the string.

<< Prev - Next >>

Last Updated ( Thursday, 16 July 2020 )