|
Page 3 of 5
Grouping and alternatives
Regular strings often have alternative forms. For example the ISBN designator could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations. You can specify an either/or situation using the vertical bar |, the alternation operator as in x|y which will match an x or a y.
For example:
@"ISBN:|ISBN-13:"
matches either ISBN: or ISBN-13:. This is easy enough but what about:
@"ISBN:|ISBN-13:\s*\d"
At first glance this seems to match either ISBN: or ISBN-13 followed by any number of white space characters and a single digit – but it doesn’t.
The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d.
To match the white space and digit in both forms of the ISBN suffix we would have to write:
@"ISBN:\s*\d|ISBN-13:\s*\d"
Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in. Anything grouped between parentheses is treated as a single unit – and grouping has a higher priority than the alternation operator.
So for example:
@"(ISBN:|ISBN-13:)\s*\d"
matches either form of the ISBN suffix followed by any number of white space characters and a single digit because the brackets limit the range of the alternation operator to the substrings to the left and right within the bracket.
The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous un-grouped expression but without the colon:
@"ISBN|ISBN-13"
In this case the first pattern, i.e. “ISBN”, will match even if the string is “ISBN-13”. It doesn’t matter that the second expression is a “better” match. No amount of grouping will help with this problem because the shorter match will be tried and succeed first. In this case the solution is to either swap the order of the sub-expressions so that the longer comes first or include something that always marks the end of the target string. For example, in this case if we add the colon then the
ISBN:
subexpression cannot possibly match the ISBN-13: string.
Capture and backreference
Now that we have explored grouping it is time to introduce the most sophisticated and useful aspect of regular expressions – the idea of “capture”.
You may think that brackets are just about grouping together items that should be matched as a group, but there is more.
A subexpression, i.e. something between brackets, is said to be “captured” if it matches and captured expressions are remembered by the engine during the match. Notice that a capture can occur before the entire expression has finished matching – indeed a capture can occur even if the entire expression eventually fails to match at all.
The .NET regular expression classes make captures available via the capture property and the CaptureCollection. Each capture group, i.e. each sub-expression surrounded by brackets, can be associated with one or more captured string. To be clear, the expression:
@"(<div>)(</div>)"
has two capture groups which by default are numbered from left-to-right with capture group 1 being the (<div>) and capture group 2 being the (</div>). The entire expression can be regarded as capture group 0 as its results are returned first by the .NET framework.
If we try out this expression on a suitable string and get the GroupCollection result of the match using the Groups property:
GroupCollection Grps = ex2.Match( @"<div></div><div></div><div></div>"). Groups;
Then, in this case, we have three capture groups – the entire expression returned as Grps[0], the first bracket i.e. capture group 1 is returned as Grps[1] and the final bracket i.e. capture group 2 as Grps[2]. The first group, i.e. the entire expression, is reported as matching only once at the start of the test string – after all we only asked for the first match.
Getting the first capture group and displaying its one and only capture demonstrates this:
CaptureCollection Caps=Groups[0].Captures; MessageBox.Show( Caps[0].Index.ToString()+ " "+Caps[0].Length.ToString()+ " "+Caps[0].ToString());
which displays 0 11 <div></div> corresponding to the first match of the complete expression.
The second capture group was similarly only captured once at the first <div> and:
CaptureCollection Caps=Groups[1].Captures; MessageBox.Show( Caps[0].Index.ToString()+ " "+Caps[0].Length.ToString()+ " "+Caps[0].ToString());
displays 0 5 <div> to indicate that it was captured by the first <div> in the string.
The final capture group was also only captured once by the final </div> and:
CaptureCollection Caps=Groups[2].Captures; MessageBox.Show( Caps[0].Index.ToString()+ " "+Caps[0].Length.ToString()+ " "+Caps[0].ToString());
displays 5 6 </div>.
Now consider the same argument over again but this time with the expression:
Regex ex2=new Regex(@"((<div>)(</div>))*");
In this case there are four capture groups including the entire expression.
Capture group 0 is the expression ((<div>)(</div>))* and this is captured once starting at 0 matching the entire string of three repeats, i.e. length 33.
The next capture group is the first, i.e. outer, bracket ((<div>)(</div>)) and it is captured three times, corresponding to the three repeats.
If you try
CaptureCollection Caps=Groups[1].Captures; for (int i = 0; i <= Caps.Count - 1; i++) { MessageBox.Show( Caps[i].Index.ToString() + " " + Caps[i].Length.ToString() + " " + Caps[i].ToString()); }
you will find the captures are at 0, 11 and 22.
The two remaining captures correspond to the <div> at 0, 11 and 22 and the </div> at 5, 16 and 27.
Notice that a capture is stored each time the bracket contents match.
<ASIN:0071668950> <ASIN:0672330792> <ASIN:193435645X> <ASIN:0321658701> <ASIN:0735627045> <ASIN:0470191376> <ASIN:0596159838>
|