Page 1 of 5
If you think regular expressions are trivial and boring you've not seen the whole picture. Here we reveal that in .NET they are amazing powerful and not to be missed.
Regular expressions are addictive.
Playing with these compressed but powerful patterns is better than solving a Sudoku.
If you are wondering what this is all about because, obviously, regular expressions are just the use of “*” and "?" then read on because the truth is a lot more subtle and the result is a lot more powerful than you might suspect.
If you know the basics of regular expressions then jump to the end of the article where you will find some deeper explainations of less used features.
It all starts with the idea of specifying a grammar for a particular set of strings. All you have to do is find a pattern that matches all of the strings you are interested in and use the pattern.
The simplest sort of pattern is the string literal that matches itself. So, for example, if you want to process ISBN numbers you might well want to match the string “ISBN:” which is its own regular expression in the sense that the pattern “ISBN:” will match exactly one string of the form “ISBN:”.
To actually use this you have to first create a Regex object with the regular expression built into it:
Regex ex1 = new Regex(@"ISBN:");
The use of the “@” at the start of the string is optional but it does make it easier when we start to use the “/” escape character.
Recall that strings starting with “@” are represented “as is” without any additional processing or conversion by C#.
To actually use the regular expression we need one of the methods offered by the Regex object.
The Match method applies the expression to a specified string and returns a Match object.
The Match object contains a range of useful properties and methods that let you track the operation of applying the regular expression to the string.
For example, if there was a match the Success property is set to true as in:
The index property gives the position of the match in the search string:
which in this case returns zero to indicate that the match is at the start of the string.
To return the actual match in the target string you can use the ToString method. Of course in this case the result is going to be identical to the regular expression but in general this isn’t the case.
Notice that the Match method returns the first match to the regular expression and you can use the NextMatch method which returns another Match object.
If this is all there was do regular expressions they wouldn’t be very interesting.
The reason they are so useful is that you can specify patterns that spell out the regularities in a type of data.
For example following the ISBN: we expect to find a digit – any digit.
This can be expressed as “ISBN:\d” where \d is character class indicator which means “a digit”.
If you try this out you will discover that you don’t get a match with the example string because there is a space following the colon. However “ISBN:\s\d” does match as \s means “any white-space character” and:
Regex ex1 = new Regex(@"ISBN:\s\d");
displays “ISBN: 9”.
There’s a range of useful character classes and you can look them up in the documentation. The most useful are:
- . (i.e. a single dot) matches any character.
- \d digit
- \s white-space
- \w any “word” character including digits
There is also the convention that capital letters match the inverse set of characters:
- \D any non-digit
- \S any non-white space
- \W any word character
Notice that the inverse sets can behave unexpectedly unless you are very clear about what they mean.
For example. \D also matches white space and hence
matches ISBN: 9.
You can also make up your own character group by listing the set of characters between square brackets.
So for example, [0-9] is the same as \d. Negating a character set is also possible and [^0-9] matches anything but the digits and is the same thing as \D.
There are also character sets that refer to Unicode but these are obvious enough in use not to need additional explanation.
As well as characters and character sets you can also use location matches or anchors.
For example, the ^ (caret) only matches the start of the string. For example, @"^ISBN:"
will only match if the string starts with ISBN: and doesn’t match if the same substring occurs anywhere else. The most useful anchors are:
- ^ start of string
- $ end of string
- \b word boundary – i.e. between a \w and \W
- \B anywhere but a word boundary
So for example:
specifies a string consisting of nothing but digits. Compare this to
which would also accept a null string.
One subtle point only emerges when you consider strings with line breaks.
In this case by default the ^ and $ match only the very start and end of the string.
If you want them to match line beginnings and endings you have to specify the /m option. It’s also worth knowing about the \G anchor which only matches at the point where the previous match ended – it is only useful when used with the NextMatch method but then it makes all matches contiguous.