Master JavaScript Regular Expressions
Written by Ian Elliot   
Thursday, 13 July 2017
Article Index
Master JavaScript Regular Expressions
Quantify
Back References

Regular expressions can seem complex but the biggest reason for this is that most programmers don't take them seriously enough. Spend just a little time finding out how they work and you can do amazing things.

 


JavaScript Data Structures 

Cover

Contents

  1. The Associative Array
  2. The String Object
  3. The Array object
  4. Speed dating - the art of the JavaScript Date object
  5. Doing JavaScript Date Calculations
  6. A Time Interval Object
  7. Collection Object
  8. Stacks, Queue & Deque
  9. The Linked List
  10. A Lisp-like list
  11. The Binary Tree
  12. Bit manipulation
  13. Typed Arrays I
  14. Typed Arrays II
  15. Master JavaScript Regular Expressions
    * First Draft

 
datastruct

Some programmers hate regular expressions because they get them wrong and can't quite figure out why they work let alone why they fail.

There are programmers who simply think of solving most problems with a good regular expression.

Why the difference?

The simple answer is that a regular expression is more powerful than you might imagine. A regular expression is a grammar that defines a set of strings that the grammar accepts i.e. that match the regular expression.

If you take regular expressions seriously as a sort of mini or "little" programming language, then you will discover their power.

Don't underestimate a regular expression.

Regular fundamentals

It all starts with the idea of specifying a grammar for a particular set of strings.

All you have to do is find a pattern that matches all of the strings you are interested in and use the pattern.

The simplest sort of pattern is the string literal that matches itself.

So, for example, if you want to process ISBN numbers you might well want to match the string “ISBN:” which is its own regular expression in the sense that the pattern “ISBN:” will match exactly one string of the form “ISBN:”.

To actually use this you have to first create a RegExp object with the regular expression built into it:

var ex1= new RegExp("ISBN:");

You can also specify the same RegExp object as a literal:

var ex1= /ISBN:/;

The only difference is that the constructor is faster if you know that the expression is going to change or it is specified as a variable.

To actually use the regular expression we need one of the two RegExp methods, test or exec, or one of the many string functions that accept a regular expression. This article concentrates on the form of the regular expression you can use rather than the methods that use them. However, we do need some methods to test things out.

Let's start with the one of the standard RegExp methods - test.

The test method simply returns true or false depending on whether or not the expression matches the string. For example:

var test=ex1.test("ISBN: ‌‌978-1871962406");

sets test to true and:

var test=ex1.test("ISBxN:978-1871962406");

sets test to false.

Notice that the test method tells you nothing at all about the nature of the match - just that it does or doesn't.

A more useful and informative method is RegExp exec command which returns a lot of information, but for the moment you can consider it as returning the first match in the string specified.

If you are in any doubt simply look the methods up in the documentation.  The string function match works like exec. The string function search returns the position of the match or -1 if there is no match. Replace and split are slightly more complicated.

Pattern matching

If this is all there was to regular expressions they wouldn’t be very interesting. The reason they are so useful is that you can specify patterns that spell out the regularities in the data. It allows you to specify what a valid telephone number, password or serial number looks like.

For example, following the ISBN: we expect to find a digit – any digit. This can be expressed as “ISBN:\d” where \d is character class indicator which means “a digit”.

Now we come to an irritation.

A lot of the RegExp symbols start with a backslash. If you are using the RegExp constructor note that the back slash has be doubled up. The reason is that it is also the string escape character and to enter a single \ you have to enter \\.

However if you use the RegExp literal form you don't have to double the backslash.

For example:

var ex1= new RegExp("ISBN\\d");

and

var ex1=/ISBN\d/;

create the same RegExp object.

The only problem is when you need to include a / character in the literal form - something you find occurs a lot due to HTML - then you have to use an escape character. For example, to search for / in </div> say you would have to enter:

var ex1=/<\/div>/;

For simplicity from now on we will use the literal form to avoid doubling up on backslashes as much as possible.

To sum up:

  • remember to double up on the backslash in quoted strings.
  • remember to use /\ if you want to search for a \ in RegExp literal.

If you try the expression involving just \d you will discover that you don’t get a match with the example string because there is a space following the colon.

However “ISBN:\s\d” does match as \s means “any white-space character” and:

var ex1=/ISBN:\s\d/;
var t=ex1.test("ISBN:‌ ‌978-1871962406");

sets t to true.

There’s a range of symbols that match useful character classes and you can look them up in the documentation. The most useful are:

(i.e. a single dot) matches any character
\d digit
\s white-space
\w any “word” character including digits

 

There is also the convention that capital letters match the inverse set of characters:

\D any non-digit
\S any non white-space
\W any non-word character 

 

Notice that the inverse sets can behave unexpectedly unless you are very clear about what they mean.

For example. \D also matches white space and hence

/ISBN:\D\d/

matches ISBN: 9. 

You can also make up your own character group by listing the set of characters between square brackets.

So for example, [0-9] is the same as \d. Negating a character set is also possible and [^0-9] matches anything but the digits and is the same thing as \D.

There are also character sets that refer to Unicode but these are obvious enough in use not to need additional explanation.

datastruct



Last Updated ( Thursday, 13 July 2017 )