Free email newsletter: “ES.next News

2013-08-08

The flag /g of JavaScript’s regular expressions

This blog post describes when and how to use regular expressions whose flag /g is set and what can go wrong.
(If you want to read a more general introduction to regular expressions, consult [1].)

The flag /g of regular expressions

Sometimes, a regular expression should match the same string multiple times. Then the regular expression object needs to be created with the flag /g set (be it via a regular expression literal, be it via the constructor RegExp). That leads to the property global of the regular expression object being true and to several operations behaving differently.
    > var regex = /x/g;
    > regex.global
    true
The property lastIndex is used to keep track where in the string matching should continue, as we shall see in a moment.

RegExp.prototype.test(): determining whether there is a match

Regular expressions have the method
    RegExp.prototype.test(str)
Without the flag /g, the method test() of regular expressions simply checks whether there is a match somewhere in str:
    > var str = '_x_x';

    > /x/.test(str)
    true
With the flag /g set, test() returns true as many times as there are matches in the string. lastIndex contains the index after the last match.
    > var regex = /x/g;
    > regex.lastIndex
    0
    > regex.test(str)
    true
    > regex.lastIndex
    2
    > regex.test(str)
    true
    > regex.lastIndex
    4
    > regex.test(str)
    false

String.prototype.search(): finding the index of a match

Strings have the method
    String.prototype.search(regex)
This method ignores the properties global and lastIndex of regex. It returns the index where regex matches (the first time).
    > '_x_x'.search(/x/)
    1

RegExp.prototype.exec(): capturing groups, optionally repeatedly

Regular expressions have the method
    RegExp.prototype.exec(str)
If the flag /g is not set then this method always returns the match object [1] for the first match:
    > var str = '_x_x';
    > var regex1 = /x/;

    > regex1.exec(str)
    [ 'x', index: 1, input: '_x_x' ]
    > regex1.exec(str)
    [ 'x', index: 1, input: '_x_x' ]
If the flag /g is set, then all matches are returned – the first one on the first invocation, the second one on the second invocation, etc.
    > var regex2 = /x/g;

    > regex2.exec(str)
    [ 'x', index: 1, input: '_x_x' ]
    > regex2.exec(str)
    [ 'x', index: 3, input: '_x_x' ]
    > regex2.exec(str)
    null

String.prototype.match():

Strings have the method
    String.prototype.match(regex)
If the flag /g of regex is not set then this method behaves like RegExp.prototype.exec(). If the flag /g is set then this method returns all matching substrings of the string (every group 0). If there is no match then null is returned.
    > var regex = /x/g;

    > '_x_x'.match(regex)
    [ 'x', 'x' ]
    > 'abc'.match(regex)
    null

replace(): search and replace

Strings have the method
    String.prototype.replace(search, replacement)
If search is either a string or a regular expression whose flag /g is not set, then only the first match is replaced. If the flag /g is set, then all matches are replaced.
    > '_x_x'.replace(/x/, 'y')
    '_y_x'
    > '_x_x'.replace(/x/g, 'y')
    '_y_y'

The problem with the /g flag

Regular expressions whose /g flag is set are problematic if a method working with them must be invoked multiple times to return all results. That’s the case for two methods:
  • RegExp.prototype.test()
  • RegExp.prototype.exec()
Then JavaScript abuses the regular expression as an iterator, as a pointer into the sequence of results. That causes problems:
  • You can’t inline the regular expression when you call those methods. For example:
        // Don’t do that:
        var count = 0;
        while (/a/g.test('babaa')) count++;
    
    The above loop is infinite, because a new regular expression is created for each loop iteration, which restarts the iteration over the results. Therefore, the above code must be rewritten:
        var count = 0;
        var regex = /a/g;
        while (regex.test('babaa')) count++;
    
    Note: it’s a best practice not to inline, anyway, but you have to be aware that you can’t do it, not even in quick hacks.
  • Code that wants to invoke test() and exec() multiple times must be careful with regular expressions handed to it as a parameter. Their flag /g must be set and it must reset their lastIndex.
The following example illustrates the latter problem.

Example: counting occurrences

The following is a naive implementation of a function that counts how many matches there are for the regular expression regex in the string str.
    // Naive implementation
    function countOccurrences(regex, str) {
        var count = 0;
        while (regex.test(str)) count++;
        return count;
    }
An example of using this function:
    > countOccurrences(/x/g, '_x_x')
    2
The first problem is that this function goes into an infinite loop if the regular expression’s /g flag is not set, e.g.:
    countOccurrences(/x/, '_x_x')
The second problem is that the function doesn’t work correctly if regex.lastIndex isn’t 0. For example:
    > var regex = /x/g;
    > regex.lastIndex = 2;
    2
    > countOccurrences(regex, '_x_x')
    1
The following implementation fixes the two problems:
    function countOccurrences(regex, str) {
        if (! regex.global) {
            throw new Error('Please set flag /g of regex');
        }
        var origLastIndex = regex.lastIndex;  // store
        regex.lastIndex = 0;

        var count = 0;
        while (regex.test(str)) count++;

        regex.lastIndex = origLastIndex;  // restore
        return count;
    }

Using match() to count occurrences

A simpler alternative is to use match():
    function countOccurrences(regex, str) {
        if (! regex.global) {
            throw new Error('Please set flag /g of regex');
        }
        return (str.match(regex) || []).length;
    }
One possible pitfall: str.match() returns null if the /g flag is set and there are no matches (solved above by accessing length of [] if the result of match() isn’t truthy).

Performance considerations

Juan Ignacio Dopazo compared the performance of the two implementations of counting occurrences and found out that using test() is faster, presumably because it doesn’t collect the results in an array.

Acknowledgements

Mathias Bynens and Juan Ignacio Dopazo pointed me to match() and test(), Šime Vidas warned me about being careful with match() if there are no matches.

Reference

  1. JavaScript: an overview of the regular expression API

No comments: