|
Page 5 of 5
Reduction
In many cases all you are doing is trying to reduce the capture count for name2 and in this case you can leave out any reference to name1.
This sounds complicated but in practice it isn’t too difficult.
For example, let’s write an expression that matches any number of As followed by the same number of Bs:
Regex ex3 = new Regex( @"^(?<COUNT>A)+(?<-COUNT>B)+");
This works, up to a point, in that it matches equal number of A and Bs starting from the beginning of the string but it doesn’t reject a string like AABBB which it simply matches to AABB.
Each time the first capture group hits an A it adds a capture to the capture set - so in this case there are two captures when the second capture group hits the first B. This reduces A’s capture set to 1 and then to zero when the second B is encountered which causes the match to backtrack to the second B when the third B is encountered and the match succeeds.
To make the entire match fail we also have to include the condition that we should now be at the end of the string.
Regex ex3 = new Regex(
@"^(?<COUNT>A)+(?<-COUNT>B)+$");
This now fails on AABBB but it matches AAABB because in the case the second capture group doesn’t fail before we reach the end of the string.
We really need a test that amounts to
“at the end of the string/match the count capture group should be null”
To do this we need some sort of conditional test on the capture and .NET provides just this:
(?(name)regex1|regex2)
will use regex1 if the capture is non-empty and regex2 if it is empty.
In fact this conditional is more general than this in that name can be a general regular expression. You can leave regex2 out if you want an “if then” rather than an “if then else”.
With this our new expression is:
Regex ex3 = new Regex( @"^(?<COUNT>A)+(?<-COUNT>B)+ (?(COUNT)^.)$");
The ^. is doesn’t match any character and so it forces the match to fail if the capture group isn’t empty.
A more symmetrical if…then…else form of the same expression is:
Regex ex3 = new Regex( @"^(?<COUNT>A)+(?<-COUNT>B)+ (?(COUNT)^.|(?=$))");
In this case the else part of the conditional asserts that we are at the end of the string.
Replacements
So far we have created regular expressions with the idea that we can use them to test that a string meets a specification or to extract a substring.
These are the two conventional uses of regular expressions. However you can also use them to perform some very complicated string editing and rearrangements.
The whole key to this idea is that notion that you can use the captures as part of the specified replacement string. The only slight problem is that the substitution strings use a slightly different syntax to a regular expression.
The Replace method:
ex1.Replace(input,substitution)
simply takes every match of the associated regular expression and performs the substitution specified. Notice that it performs the substitution on every match and the result returned is the entire string with the substitutions made.
There are other versions of the Replace method but they all work in more or less the same way.
For example, if we define the regular expression:
Regex ex1 = new Regex(@"(ISBN|ISBN-13)");
and apply the following replacement:
MessageBox.Show( ex1.Replace(@"ISBN: 978-1871962406", "ISBN-13"));
then the ISBN suffix will be replaced by ISBN-13. Notice that an ISBN-13 suffix will also be replaced by ISBN-13 so making all ISBN strings consistent. Also notice that if there are multiple ISBNs within the string they will all be matched and replaced. There are versions of the method that allow you to restrict the number of matches that are replaced.
This is easy enough to follow and works well as long as you have defined your regular expression precisely enough. More sophisticated is the use of capture groups within the substitution string.
You can use:
@"$n"
to refer to capture group n or:
@"${name}"
to refer to a capture group by name. There are a range of other substitution strings but these are fairly obvious in use.
As an example of how this all works consider the problem of converting a US format date to a UK format date. First we need a regular expression to match the mm/dd/yyyy format:
Regex ex1 = new Regex( @"(?\d{1,2})/ (?<day>\d{1,2})/ (?<year>\d{4})");
This isn’t a particularly sophisticated regular expression but we have allowed one or two digits for the month and day numbers but insisted on four for the year number. You can write a more interesting and flexible regular expression for use with real data. Notice that we have three named capture groups corresponding to month, day and year.
To create a European style date all we have to do assemble the capture groups in the correct order in a substitution string:
MessageBox.Show(ex1.Replace( @" 10/2/2008", "${day}/${month}/${year}$"));
This substitutes the day, month and year capture groups in place of the entire matched string, i.e. the original date.
Avoid overuse
Regular expressions are addictive in a way that can ultimately be unproductive.
It isn’t worth spending days crafting a single regular expression that matches all variations on a string when building one or two simpler alternatives and using a wider range of string operations would do the same job as well if not as neatly.
Resist the temptation to write regular expressions that you only just understand and always make sure you test them with strings that go well outside of the range of inputs that you consider correct – greedy matching and backtracking often result in the acceptance of a wider range of strings that was originally intended.
If you take care, however, regular expressions are a very powerful way of processing and transforming text without the need to move to a complete syntax analysis package.
If you would like to suggest a topic for our Core C# section or if you have any comments contact our C# editor Mike James.
If you would like to be informed about new articles on I Programmer you can either follow us on Twitter, on Facebook or you can subscribe to our weekly newsletter.
Getting started with C# Metro apps
How does Metro development in C# differ from desktop development? After looking at some general differences and the overall structure of a Metro app, we move on to consider how to make use of asynchro [ ... ]
|
How to number crunch - NAG for .NET
Number crunching in C# (or any .NET language) is a problem because it doesn't have a long tradition of implementing numerical methods. So why not use a library that was originally implemented in Fortr [ ... ]
| | Other Articles |
<ASIN:0070077509> <ASIN:0672330636> <ASIN:0495806439> <ASIN:0321636414> <ASIN:0735626707> <ASIN:0470127902> <ASIN:0596007124> <ASIN:1430225254>
|