|
Page 1 of 5
Regular expressions are addictive.
Playing with these compressed but powerful patterns is better than solving a Sudoku.
If you are wondering what this is all about because, obviously, regular expressions are just the use of “*” and "?" then read on because the truth is a lot more subtle and the result is a lot more powerful than you might suspect.
Equally, regular expressions are something that you will find in more than just C#, they are useful in Javascript, Perl, Java, Ruby and even in applications such as word processors.
Regular fundamentals
It all starts with the idea of specifying a grammar for a particular set of strings. All you have to do is find a pattern that matches all of the strings you are interested in and use the pattern.
The simplest sort of pattern is the string literal that matches itself. So, for example, if you want to process ISBN numbers you might well want to match the string “ISBN:” which is its own regular expression in the sense that the pattern “ISBN:” will match exactly one string of the form “ISBN:”.
To actually use this you have to first create a Regex object with the regular expression built into it:
Regex ex1 = new Regex(@"ISBN:");
The use of the “@” at the start of the string is optional but it does make it easier when we start to use the “/” escape character. Recall that strings starting with “@” are represented “as is” without any additional processing or conversion by C#.
To actually use the regular expression we need one of the methods offered by the Regex object.
The Match method applies the expression to a specified string and returns a Match object.
The Match object contains a range of useful properties and methods that let you track the operation of applying the regular expression to the string.
For example, if there was a match the Success property is set to true as in:
MessageBox.Show(ex1.Match( @"ISBN:978-1871962406"). Success.ToString());
The index property gives the position of the match in the search string:
MessageBox.Show(ex1.Match( @"ISBN: 978-1871962406"). Index.ToString());
which in this case returns zero to indicate that the match is at the start of the string.
To return the actual match in the target string you can use the ToString method. Of course in this case the result is going to be identical to the regular expression but in general this isn’t the case.
Notice that the Match method returns the first match to the regular expression and you can use the NextMatch method which returns another Match object.
Pattern matching
If this is all there was do regular expressions they wouldn’t be very interesting.
The reason they are so useful is that you can specify patterns that spell out the regularities in a type of data.
For example following the ISBN: we expect to find a digit – any digit.
This can be expressed as “ISBN:\d” where \d is character class indicator which means “a digit”.
If you try this out you will discover that you don’t get a match with the example string because there is a space following the colon. However “ISBN:\s\d” does match as \s means “any white-space character” and:
Regex ex1 = new Regex(@"ISBN:\s\d"); MessageBox.Show(ex1.Match( @"ISBN: 978-1871962406").ToString();
displays “ISBN: 9”.
There’s a range of useful character classes and you can look them up in the documentation. The most useful are:
- . (i.e. a single dot) matches any character.
- \d digit
- \s white-space
- \w any “word” character including digits
There is also the convention that capital letters match the inverse set of characters:
- \D any non-digit
- \S any non-white space
- \W any word character
Notice that the inverse sets can behave unexpectedly unless you are very clear about what they mean.
For example. \D also matches white space and hence
@"ISBN:\D\d"
matches ISBN: 9.
You can also make up your own character group by listing the set of characters between square brackets.
So for example, [0-9] is the same as \d. Negating a character set is also possible and [^0-9] matches anything but the digits and is the same thing as \D.
There are also character sets that refer to Unicode but these are obvious enough in use not to need additional explanation.
<ASIN:0131857258> <ASIN:0470548657> <ASIN:1449380344> <ASIN:0321718933> <ASIN:0470495995> <ASIN:0470447613>
<ASIN:1430229799>
|