2011-04-01

JavaScript: an overview of the regular expression API

Updates: This post gives an overview of the JavaScript API for regular expressions. It does not, however, go into details about regular expression syntax, so you should already be familiar with it.

Regular expression syntax

Listed below are constructs that are hard to remember (not listed are things like * for repetition, capturing groups, etc.).
  • Escaping: the backslash escapes special characters, including the slash in regular expression literals (see below) and the backslash itself.
    • If you specify a regular expression in a string you must escape twice: once for the string literal, once for the regular expression. For example, to just match a backslash, the string literal becomes "\\\\".
    • The backslash is also used for some special matching operators (see below).
  • Non-capturing group: (?:x) works like a capturing group for delineating the subexpression x, but does not return matches and thus does not have a group number.
  • Positive look-ahead: x(?=y) means that x matches only if it is followed by y. y itself is not counted as part of the regular expression.
  • Negative look-ahead: x(?!y) the negated version of the previous construct: x must not be followed by y.
  • Repetitions: {n} matches exactly n times, {n,} matches at least n times, {n,m} matches at least n, at most m times.
  • Control characters: \cX matches Ctrl-X (for any control character X), \n matches a linefeed, \r matches a carriage return.
  • Back reference: \n refers back to group n and matches its contents again.
Examples:
    > /(a+)b\1/.test("aaba")
    true
    > /^(a+)b\1/.test("aaba")
    false
    > var tagName = /<([^>]+)>[^<]*<\/\1>/;
    > tagName.exec("<b>bold</b>")[1]
    'b'
    > tagName.exec("<strong>text</strong>")[1]
    'strong'
    > tagName.exec("<strong>text</stron>")
    null

Creating a regular expression

There are two ways to create a regular expression.
Regular expression literal: var regex = /xyz/; (compiled at load time)
Regular expression object: var regex = new RegExp("xzy"); (compiled at runtime)
Flags modify matching behavior.
g global The given regular expression is matched multiple times.
i ignoreCase Case is ignored when trying to match the given regular expression.
m multiline In multiline mode, the begin and end operators ^ and $ work for each line, instead of for the complete input string.
Examples:
    > /abc/.test("ABC")
    false
    > /abc/i.test("ABC")
    true
Regular expressions have the following properties.
  • Flags: boolean values indicating what flags are set.
    • global: is flag g set?
    • ignoreCase: is flag i set?
    • multiline: is flag m set?
  • If flag g is set:
    • lastIndex: the index where to continue matching next time.

RegExp.prototype.test(): determining whether there is a match

The following method returns a boolean indicating whether the match succeeded.
    regex.test(str)
Examples:
    > var regex = /^(a+)b\1$/;
    > regex.test("aabaa")
    true
    > regex.test("aaba")
    false
If the flag g is set then test() returns true as often as there are matches in the string.
    > var regex = /b/g;
    > var str = 'abba';

    > regex.test(str)
    true
    > regex.test(str)
    true
    > regex.test(str)
    false

String.prototype.search(): finding the index of a match

The following method returns the index where a match was found and -1 otherwise.
    str.search(regex)
search() completely ignores the flag g. Examples:
    > 'abba'.search(/b/)
    1
    > 'abba'.search(/x/)
    -1

RegExp.prototype.exec(): capture groups, optionally repeatedly

    var matchData = regex.exec(str);
matchData is null if there wasn’t a match. Otherwise, it is an array with two additional properties.
  • Properties:
    • input: The complete input string.
    • index: The index where the match was found.
  • Array: whose length is the number of capturing groups plus one.
    • 0: The match for the complete regular expression (group 0, if you will).
    • n ≥ 1: The capture of group n.
Invoke once: Flag global is not set.
    > var regex = /a(b+)a/;
    > regex.exec("_abbba_aba_")
    [ 'abbba'
    , 'bbb'
    , index: 1
    , input: '_abbba_aba_'
    ]
    > regex.lastIndex
    0
Invoke repeatedly: Flag global is set.
    > var regex = /a(b+)a/g;
    > regex.exec("_abbba_aba_")
    [ 'abbba'
    , 'bbb'
    , index: 1
    , input: '_abbba_aba_'
    ]
    > regex.lastIndex
    6
    > regex.exec()
    [ 'aba'
    , 'b'
    , index: 7
    , input: '_abbba_aba_'
    ]
    > regex.exec()
    null
Loop over matches.
    var regex = /a(b+)a/g;
    var str = "_abbba_aba_";
    while(true) {
        var match = regex.exec(str);
        if (!match) break;
        console.log(match[1]);
    }
Output:
    bbb
    b

String.prototype.match(): capture groups or all matches

    var matchData = str.match(regex);
If the flag g of regex is not set, this method works like RegExp.prototype.exec(). If the flag is set then it returns an array with all matching substrings in str (i.e., group 0 of every match) or null if there is no match.
    > 'abba'.match(/a/)
    [ 'a', index: 0, input: 'abba' ]
    > 'abba'.match(/a/g)
    [ 'a', 'a' ]
    > 'abba'.match(/x/g)
    null

String.prototype.replace(): search and replace

Invocation:
    str.replace(search, replacement)
Parameters:
  • search:
    • either a string (to be found literally, has no groups)
    • or a regular expression.
  • replacement:
    • either a string describing how to replace what has been found
    • or a function that computes a replacement, given matching information.
Replacement is a string. The dollar sign $ is used to indicate special replacement directives:
  • $$ inserts a dollar sign $.
  • $& inserts the complete match.
  • $` inserts the text before the match.
  • $' inserts the text after the match.
  • $n inserts group n from the match. n must be at least 1, $0 has no special meaning.
Examples:
    > "a1b_c1d".replace("1", "[$`-$&-$']")
    'a[a-1-b_c1d]b_c1d'
    > "a1b_c1d".replace(/1/, "[$`-$&-$']")
    'a[a-1-b_c1d]b_c1d'
    > "a1b_c1d".replace(/1/g, "[$`-$&-$']")
    'a[a-1-b_c1d]b_c[a1b_c-1-d]d'
Replacement is a function. The replacement function has the following signature.
function(completeMatch, group_1, ..., group_n, offset, inputStr) { ... }
completeMatch is the same as $& above, offset indicates where the match was found, and inputStr is what is being matched against. Thus, the special variable arguments inside the function starts with the same data as the result of the exec() method.

Example:

    > "I bought 3 apples and 5 oranges".replace(
          /[0-9]+/g,
          function(match) { return 2 * match; })
    'I bought 6 apples and 10 oranges'

String.prototype.split(): splitting strings

In a string, find the substrings between separators and return them in an array. Signature:
    str.split(separator, limit?)
Parameters:
  • separator can be
    • a string: separators are matched verbatim
    • a regular expression: for more flexible separator matching. Many JavaScript implementations include the first capturing group in the result array, if there is one.
  • limit optionally specifies a maximum length for the returned array. A value less than 0 allows arbitrary lengths.
Examples:
    > "aaa*a*".split("a*")
    [ 'aa', '', '' ]
    > "aaa*a*".split(/a*/)
    [ '', '*', '*' ]
    > "aaa*a*".split(/(a*)/)
    [ '', 'aaa', '*', 'a', '*' ]

Sources

  • ECMAScript Language Specification, 5th edition.
  • Regular Expressions at the Mozilla Developer Network Doc Center

No comments: