New regular expression features in ECMAScript 6

[2015-07-29] esnext, dev, javascript
(Ad, please don’t block)

This blog post explains new regular expression features in ECMAScript 6. It helps if you are familiar with ES5 regular expression features and Unicode. Consult the following two chapters of “Speaking JavaScript” if you aren’t:

Overview  

The following regular expression features are new in ECMAScript 6:

  • The new flag /y (sticky) anchors each match of a regular expression to the end of the previous match.

  • The new flag /u (unicode) handles surrogate pairs (such as \uD83D\uDE80) as code points and lets you use Unicode code point escapes (such as \u{1F680}) in regular expressions.

  • The new data property flags gives you access to the flags of a regular expression, just like source already gives you access to the pattern in ES5:

    > /abc/ig.source // ES5
    'abc'
    > /abc/ig.flags // ES6
    'gi'
    
  • You can use the constructor RegExp() to make a copy of a regular expression:

    > new RegExp(/abc/ig).flags
    'gi'
    > new RegExp(/abc/ig, 'i').flags // change flags
    'i'
    

New flag /y (sticky)  

The new flag /y changes two things while matching a regular expression re against a string:

  • Anchored to re.lastIndex: The match must start at re.lastIndex (the index after the previous match). This behavior is similar to the ^ anchor, but with that anchor, matches must always start at index 0.
  • Match repeatedly: If a match was found, re.lastIndex is set to the index after the match. This behavior is similar to the /g flag. Like /g, /y is normally used to match multiple times.

The main use case for this matching behavior is tokenizing, where you want each match to immediately follow its predecessor. An example of tokenizing via a sticky regular expression and exec() is given later.

Let’s look at how various regular expression operations react to the /y flag. The following tables give an overview. I’ll provide more details afterwards.

Methods of regular expressions (re is the regular expression that a method is invoked on):

Flags Start matching Anchored to Result if match No match re.lastIndex
exec() 0 Match object null unchanged
/g re.lastIndex Match object null index after match
/y re.lastIndex re.lastIndex Match object null index after match
/gy re.lastIndex re.lastIndex Match object null index after match
test() (Any) (like exec()) (like exec()) true false (like exec())

Methods of strings (str is the string that a method is invoked on, re is the regular expression parameter):

Flags Start matching Anchored to Result if match No match re.lastIndex
search() –, /g 0 Index of match -1 unchanged
/y, /gy 0 0 Index of match -1 unchanged
match() 0 Match object null unchanged
/y re.lastIndex re.lastIndex Match object null index after match
/g After prev. match (loop) Array with matches null 0
/gy After prev. match (loop) After prev. match Array with matches null 0
split() –, /g After prev. match (loop) Array with strings between matches [str] unchanged
/y, /gy After prev. match (loop) After prev. match Array with empty strings between matches [str] unchanged
replace() 0 First match replaced No repl. unchanged
/y 0 0 First match replaced No repl. unchanged
/g After prev. match (loop) All matches replaced No repl. unchanged
/gy After prev. match (loop) After prev. match All matches replaced No repl. unchanged

RegExp.prototype.exec(str)  

If /g is not set, matching always starts at the beginning, but skips ahead until a match is found. REGEX.lastIndex is not changed.

const REGEX = /a/;

REGEX.lastIndex = 7; // ignored
const match = REGEX.exec('xaxa');
console.log(match.index); // 1
console.log(REGEX.lastIndex); // 7 (unchanged)

If /g is set, matching starts at REGEX.lastIndex and skips ahead until a match is found. REGEX.lastIndex is set to the position after the match. That means that you receive all matches if you loop until exec() returns null.

const REGEX = /a/g;

REGEX.lastIndex = 2;
const match = REGEX.exec('xaxa');
console.log(match.index); // 3
console.log(REGEX.lastIndex); // 4 (updated)

// No match at index 4 or later
console.log(REGEX.exec('xaxa')); // null

If only /y is set, matching starts at REGEX.lastIndex and is anchored to that position (no skipping ahead until a match is found). REGEX.lastIndex is updated similarly to when /g is set.

const REGEX = /a/y;

// No match at index 2
REGEX.lastIndex = 2;
console.log(REGEX.exec('xaxa')); // null

// Match at index 3
REGEX.lastIndex = 3;
const match = REGEX.exec('xaxa');
console.log(match.index); // 3
console.log(REGEX.lastIndex); // 4

Setting both /y and /g is the same as only setting /y.

RegExp.prototype.test(str)  

test() works the same as exec(), but it returns true or false (instead of a match object or null) when matching succeeds or fails:

const REGEX = /a/y;

REGEX.lastIndex = 2;
console.log(REGEX.test('xaxa')); // false

REGEX.lastIndex = 3;
console.log(REGEX.test('xaxa')); // true
console.log(REGEX.lastIndex); // 4

String.prototype.search(regex)  

search() ignores the flag /g and lastIndex (which is not changed, either). Starting at the beginning of the string, it looks for the first match and returns its index (or -1 if there was no match):

const REGEX = /a/;

REGEX.lastIndex = 2; // ignored
console.log('xaxa'.search(REGEX)); // 1

If you set the flag /y, lastIndex is still ignored, but the regular expression is now anchored to index 0.

const REGEX = /a/y;

REGEX.lastIndex = 1; // ignored
console.log('xaxa'.search(REGEX)); // -1 (no match)

String.prototype.match(regex)  

match() has two modes:

  • If /g is not set, it works like exec().
  • If /g is set, it returns an Array with the string parts that matched, or null.

If the flag /g is not set, match() captures groups like exec():

{
    const REGEX = /a/;

    REGEX.lastIndex = 7; // ignored
    console.log('xaxa'.match(REGEX).index); // 1
    console.log(REGEX.lastIndex); // 7 (unchanged)
}
{
    const REGEX = /a/y;

    REGEX.lastIndex = 2;
    console.log('xaxa'.match(REGEX)); // null

    REGEX.lastIndex = 3;
    console.log('xaxa'.match(REGEX).index); // 3
    console.log(REGEX.lastIndex); // 4
}

If only the flag /g is set then match() returns all matching substrings in an Array (or null). Matching always starts at position 0.

const REGEX = /a|b/g;
REGEX.lastIndex = 7;
console.log('xaxb'.match(REGEX)); // ['a', 'b']
console.log(REGEX.lastIndex); // 0

If you additionally set the flag /y, then matching is still performed repeatedly, while anchoring the regular expression to the index after the previous match (or 0).

const REGEX = /a|b/gy;

REGEX.lastIndex = 0; // ignored
console.log('xab'.match(REGEX)); // null
REGEX.lastIndex = 1; // ignored
console.log('xab'.match(REGEX)); // null

console.log('ab'.match(REGEX)); // ['a', 'b']
console.log('axb'.match(REGEX)); // ['a']

String.prototype.split(separator, limit)  

The complete details of split() are explained in Speaking JavaScript.

For ES6, it is interesting to see how things change if you use the flag /y.

With /y, the string must start with a separator:

> 'x##'.split(/#/y) // no match
[ 'x##' ]
> '##x'.split(/#/y) // 2 matches
[ '', '', 'x' ]

Subsequent separators are only recognized if they immediately follow the first separator:

> '#x#'.split(/#/y) // 1 match
[ '', 'x#' ]
> '##'.split(/#/y) // 2 matches
[ '', '', '' ]

That means that the string before the first separator and the strings between separators are always empty.

As usual, you can use groups to put parts of the separators into the result array:

> '##'.split(/(#)/y)
[ '', '#', '', '#', '' ]

String.prototype.replace(search, replacement)  

Without the flag /g, replace() only replaces the first match:

const REGEX = /a/;

// One match
console.log('xaxa'.replace(REGEX, '-')); // 'x-xa'

If only /y is set, you also get at most one match, but that match is always anchored to the beginning of the string. lastIndex is ignored and unchanged.

const REGEX = /a/y;

// Anchored to beginning of string, no match
REGEX.lastIndex = 1; // ignored
console.log('xaxa'.replace(REGEX, '-')); // 'xaxa'
console.log(REGEX.lastIndex); // 1 (unchanged)

// One match
console.log('axa'.replace(REGEX, '-')); // '-xa'

With /g set, replace() replaces all matches:

const REGEX = /a/g;

// Multiple matches
console.log('xaxa'.replace(REGEX, '-')); // 'x-x-'

With /gy set, replace() replaces all matches, but each match is anchored to the end of the previous match:

const REGEX = /a/gy;

// Multiple matches
console.log('aaxa'.replace(REGEX, '-')); // '--xa'

The parameter replacement can also be a function, consult “Speaking JavaScript” for details.

Example: using sticky matching for tokenizing  

The main use case for sticky matching is tokenizing, turning a text into a sequence of tokens. One important trait about tokenizing is that tokens are fragments of the text and that there must be no gaps between them. Therefore, sticky matching is perfect here.

function tokenize(TOKEN_REGEX, str) {
    let result = [];
    let match;
    while (match = TOKEN_REGEX.exec(str)) {
        result.push(match[1]);
    }
    return result;
}

const TOKEN_GY = /\s*(\+|[0-9]+)\s*/gy;
const TOKEN_G  = /\s*(\+|[0-9]+)\s*/g;

In a legal sequence of tokens, sticky matching and non-sticky matching produce the same output:

> tokenize(TOKEN_GY, '3 + 4')
[ '3', '+', '4' ]
> tokenize(TOKEN_G, '3 + 4')
[ '3', '+', '4' ]

If, however, there is non-token text in the string then sticky matching stops tokenizing, while non-sticky matching skips the non-token text:

> tokenize(TOKEN_GY, '3x + 4')
[ '3' ]
> tokenize(TOKEN_G, '3x + 4')
[ '3', '+', '4' ]

The behavior of sticky matching during tokenizing helps with error handling.

Example: manually implementing sticky matching  

If you wanted to manually implement sticky matching, you’d do it as follows: The function execSticky() works like RegExp.prototype.exec() in sticky mode.

function execSticky(regex, str) {
    // Anchor the regex to the beginning of the string
    let matchSource = regex.source;
    if (!matchSource.startsWith('^')) {
        matchSource = '^' + matchSource;
    }
    // Ensure that instance property `lastIndex` is updated
    let matchFlags = regex.flags; // ES6 feature!
    if (!regex.global) {
        matchFlags = matchFlags + 'g';
    }
    let matchRegex = new RegExp(matchSource, matchFlags);

    // Ensure we start matching `str` at `regex.lastIndex`
    const matchOffset = regex.lastIndex;
    const matchStr = str.slice(matchOffset);
    let match = matchRegex.exec(matchStr);

    // Translate indices from `matchStr` to `str`
    regex.lastIndex = matchRegex.lastIndex + matchOffset;
    match.index = match.index + matchOffset;
    return match;
}

New flag /u (unicode)  

The flag /u switches on a special Unicode mode for a regular expression. That mode has two features:

  1. You can use Unicode code point escape sequences such as \u{1F42A} for specifying characters via code points. Normal Unicode escapes such as \u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane).

  2. “characters” in the regular expression pattern and the string are code points (not UTF-16 code units). Code units are converted into code points.

A later section has more information on escape sequences. I’ll explain the consequences of feature 2 next. Instead of Unicode code point escapes (e.g., \u{1F680}), I’m using two UTF-16 code units (e.g., \uD83D\uDE80). That makes it clear that surrogate pairs are grouped in Unicode mode and works in both Unicode mode and non-Unicode mode.

> '\u{1F680}' === '\uD83D\uDE80' // code point vs. surrogate pairs
true

Consequence: lone surrogates in the regular expression only match lone surrogates  

In non-Unicode mode, a lone surrogate in a regular expression is even found inside (surrogate pairs encoding) code points:

> /\uD83D/.test('\uD83D\uDC2A')
true

In Unicode mode, surrogate pairs become atomic units and lone surrogates are not found “inside” them:

> /\uD83D/u.test('\uD83D\uDC2A')
false

Actual lone surrogate are still found:

> /\uD83D/u.test('\uD83D \uD83D\uDC2A')
true
> /\uD83D/u.test('\uD83D\uDC2A \uD83D')
true

Consequence: you can put code points in character classes  

In Unicode mode, you can put code points into character classes and they won’t be interpreted as two characters, anymore.

> /^[\uD83D\uDC2A]$/u.test('\uD83D\uDC2A')
true
> /^[\uD83D\uDC2A]$/.test('\uD83D\uDC2A')
false

> /^[\uD83D\uDC2A]$/u.test('\uD83D')
false
> /^[\uD83D\uDC2A]$/.test('\uD83D')
true

Consequence: the dot operator (.) matches code points, not code units  

In Unicode mode, the dot operator matches code points (one or two code units). In non-Unicode mode, it matches single code units. For example:

> '\uD83D\uDE80'.match(/./gu).length
1
> '\uD83D\uDE80'.match(/./g).length
2

Consequence: quantifiers apply to code points, not code units  

In Unicode mode, quantifiers apply to code points (one or two code units). In non-Unicode mode, they apply to single code units. For example:

> /\uD83D\uDE80{2}/u.test('\uD83D\uDE80\uD83D\uDE80')
true

> /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uD83D\uDE80')
false
> /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uDE80')
true

New data property flags  

In ECMAScript 6, regular expressions have the following data properties:

  • The pattern: source
  • The flags: flags
  • Individual flags: global, ignoreCase, multiline, sticky, unicode
  • Other: lastIndex

As an aside, lastIndex is the only instance property now, all other data properties are implemented via internal instance properties and getters such as get RegExp.prototype.global.

The property source (which already existed in ES5) contains the regular expression pattern as a string:

> /abc/ig.source
'abc'

The property flags is new, it contains the flags as a string, with one character per flag:

> /abc/ig.flags
'gi'

You can’t change the flags of an existing regular expression (ignoreCase etc. have always been immutable), but flags allows you to make a copy where the flags are changed:

function copyWithIgnoreCase(regex) {
    return new RegExp(regex.source, regex.flags+'i');
}

The next section explains another way to make modified copies of regular expressions.

RegExp() can be used as a copy constructor  

In ES6 there are two variants of the constructor RegExp() (the second one is new):

  • new RegExp(pattern : string, flags = '')
    A new regular expression is created as specified via pattern. If flags is missing, the empty string '' is used.

  • new RegExp(regex : RegExp, flags = regex.flags)
    regex is cloned. If flags is provided then it determines the flags of the copy.

The following interaction demonstrates the latter variant:

> new RegExp(/abc/ig).flags
'gi'
> new RegExp(/abc/ig, 'i').flags // change flags
'i'

Therefore, the RegExp constructor gives us another way to change flags:

function copyWithIgnoreCase(regex) {
    return new RegExp(regex, regex.flags+'i');
}

Escape sequences in JavaScript  

There are three parameterized escape sequences for representing characters in JavaScript:

  • Hex escape (exactly two hexadecimal digits): \xHH

    > '\x7A' === 'z'
    true
    
  • Unicode escape (exactly four hexadecimal digits): \uHHHH

    > '\u007A' === 'z'
    true
    
  • Unicode code point escape (1 or more hexadecimal digits): \u{···}

    > '\u{7A}' === 'z'
    true
    

Unicode code point escapes are new in ES6.

The escape sequences can be used in the following locations:

\uHHHH \u{···} \xHH
Identifiers
String literals
Template literals
Regular expression literals Only with flag /u

Identifiers:

  • A 4-digit Unicode escape \uHHHH becomes a single code point.
  • A Unicode code point escape \u{···} becomes a single code point.
> let hello = 123;
> hell\u{6F}
123

String literals:

  • Strings are internally stored as UTF-16 code units.
  • A hex escape \xHH contributes a UTF-16 code unit.
  • A 4-digit Unicode escape \uHHHH contributes a UTF-16 code unit.
  • A Unicode code point escape \u{···} contributes the UTF-16 encoding of its code point (one or two UTF-16 code units).

Template literals:

  • In template literals, escape sequences are handled like in string literals.
  • In tagged templates, how escape sequences are interpreted depends on the tag function. It can choose between two interpretations:
    • Cooked: escape sequences are handled like in string literals.
    • Raw: escape sequences are handled as a sequence of characters.
> `hell\u{6F}` // cooked
'hello'
> String.raw`hell\u{6F}` // raw
'hell\\u{6F}'

Regular expressions:

  • Unicode code point escapes are only allowed if the flag /u is set, because \u{3} is interpreted as three times the character u, otherwise:

    > /^\u{3}$/.test('uuu')
    true
    

Escape sequences in the ES6 spec  

Various information:

  • The spec treats source code as a sequence of Unicode code points: “Source Text

  • Unicode escape sequences sequences in identifiers: “Names and Keywords

  • Strings are internally stored as sequences of UTF-16 code units: “String Literals

  • Strings – how various escape sequences are translated to UTF-16 code units: “Static Semantics: SV

  • Template literals – how various escape sequences are translated to UTF-16 code units: “Static Semantics: TV and TRV

Regular expressions  

The spec distinguishes between BMP patterns (flag /u not set) and Unicode patterns (flag /u set). Sect. “Pattern Semantics” explains that they are handled differently and how.

As a reminder, here is how grammar rules are be parameterized in the spec:

  • If a grammar rule R has the subscript [U] then that means there are two versions of it: R and R_U.
  • Parts of the rule can pass on the subscript via [?U].
  • If a part of a rule has the prefix [+U] it only exists if the subscript [U] is present.
  • If a part of a rule has the prefix [~U] it only exists if the subscript [U] is not present.

You can see this parameterization in action in Sect. “Patterns”, where the subscript [U] creates separate grammars for BMP patterns and Unicode patterns:

  • IdentityEscape: In BMP patterns, many characters can be prefixed with a backslash and are interpreted as themselves (for example: if \u is not followed by four hexadecimal digits, it is interpreted as u). In Unicode patterns that only works for the following characters (which frees up \u for Unicode code point escapes): ^ $ \ . * + ? ( ) [ ] { } |

  • RegExpUnicodeEscapeSequence: "\u{" HexDigits "}" is only allowed in Unicode patterns. In those patterns, lead and trail surrogates are also grouped to help with UTF-16 decoding.

Sect. “CharacterEscape” explains how various escape sequences are translated to characters (roughly: either code units or code points).

String methods using regular expressions delegate to regular expression methods  

The following string methods now delegate their work to regular expression methods:

  • String.prototype.match calls RegExp.prototype[Symbol.match].
  • String.prototype.replace calls RegExp.prototype[Symbol.replace].
  • String.prototype.search calls RegExp.prototype[Symbol.search].
  • String.prototype.split calls RegExp.prototype[Symbol.split].

Support in engines and transpilers  

As usual, consult the compatibility table by kangax to find out what is supported where:

Further reading  

If you want to know in more detail how the regular expression flag /u works, I recommend the article “Unicode-aware regular expressions in ECMAScript 6” by Mathias Bynens.