2013-09-24

Unicode and JavaScript

Update 2013-09-29: New sections 4.1 (“Matching any code unit”) and 4.2 (“Libraries”).

This blog post is a brief introduction to Unicode and how it is handled in JavaScript.

Unicode

History

Unicode was started in 1987, by Joe Becker (Xerox), Lee Collins (Apple) and Mark Davis (Apple). The idea was to create a universal character set, as there were many incompatible standards for encoding plain text at that time: numerous variations of 8 bit ASCII, Big Five (Traditional Chinese), GB 2312 (Simplified Chinese), etc. Before Unicode, no standard for multi-lingual plain text existed, but there were rich text systems (such as Apple’s WorldScript) that allowed one to combine multiple encodings.

The first Unicode draft proposal was published in 1988. Work continued afterwards and the working group expanded. The Unicode Consortium was incorporated on January 3, 1991:

The Unicode Consortium is a non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard [...]
The first volume of the Unicode 1.0 standard was published in October 1991, the second one in June 1992.

Important Unicode concepts

The idea of a character may seem a simple one, but there are many aspects to it. That’s why Unicode is such a complex standard. The following are important basic concepts:
  • Characters and graphemes: Both terms mean something quite similar. Characters are are digital entities, graphemes are atomic units of written languages (alphabetic letters, typographic ligatures, etc.). Sometimes, several characters are used to display a single grapheme.
  • Glyph: A concrete way of writing a grapheme. Sometimes the same grapheme is written differently, depending on its context or other factors. For example, the graphemes f and i can be displayed as a glyph f and a glyph i, connected by a ligature glyph. Or without a ligature.
  • Code points: Unicode maps the characters it supports to numbers called code points.
  • Code units: To store or transmit code points, they are encoded as code units, pieces of data with a fixed length. The length is measured in bits and determined by an encoding scheme, of which Unicode has several ones: UTF-8, UTF-16, etc. The number in the name indicates the length of the code units, in bits. If a code point is too large to fit into a single code unit, it must be broken up into multiple units. That is, the number of code units needed to represent a single code point can vary.
  • BOM (byte order mark): If a code unit is larger than a single byte, byte ordering matters. The BOM is a single pseudo-character (possibly encoded as multiple code units) at the beginning of a text that indicates whether the code units are big endian (most significant bytes come first) or little endian (least significant bytes come first). The default, for texts without a BOM, is big endian. The BOM also indicates the encoding that is used, it is different for UTF-8, UTF-16, etc. It also serves as a marker for Unicode, if web browsers have no other information w.r.t. the encoding of a text. However, the BOM is not used very often, for several reasons:
    • UTF-8 is by far the most popular Unicode encoding and does not need a BOM, because there is only one way of ordering bytes.
    • Several character encodings include byte ordering. Then a BOM must not be used. Examples: UTF-16BE (UTF-16 big endian), UTF-16LE, UTF-32BE, UTF-32LE. This is a safer way of handling byte ordering, because there is no danger of mixing up meta-data and data.
  • Normalization: Sometimes the same grapheme can be represented in several ways. For example, the grapheme “ö” can be represented as a single code point or as an “o” followed by a combining character “¨” (diaeresis, double dot). Normalization is about translating a text to a canonical representation; equivalent code points and sequences of code points are all translated to the same code point (or sequence of code points). That is useful for text processing, e.g. to search for text. Unicode specifies several normalizations.
  • Character properties: Each Unicode character is assigned several properties by the specification:
    • Name: an English name, composed of uppercase letters A-Z, digits 0-9, hypen - and <space>. Two examples:
      • “λ” has the name “GREEK SMALL LETTER LAMBDA”
      • “!” has the name “EXCLAMATION MARK”
    • General category: Partitions characters into categories such as letter, uppercase letter, number, punctuation, etc.
    • Age: With what version of Unicode was the character introduced (1.0, 1.1., 2.0, etc.)?
    • Deprecated: Is the use of the character discouraged?
    • And many more.

Code points

The range of the code points was initially 16 bits. With Unicode version 2.0 (July 1996), it was expanded: it is now divided into 17 planes, numbered from 0 to 16. Each plane comprises 16 bits (in hexadecimal notation: 0x0000–0xFFFF). Thus, in the hexadecimal ranges shown below, digits beyond the four bottom ones contain the number of the plane.
  • Plane 0: Basic Multilingual Plane (BMP): 0x0000–​0xFFFF
  • Plane 1: Supplementary Multilingual Plane (SMP): 0x10000–​0x1FFFF
  • Plane 2: Supplementary Ideographic Plane (SIP): 0x20000–​0x2FFFF
  • Planes 3–13: Unassigned
  • Plane 14: Supplement­ary Special-Purpose Plane (SSP: 0xE0000–​0xEFFFF
  • Planes 15–16: Supplement­ary Private Use Area (S PUA A/B): 0x0F0000–0x10FFFF
Planes 1–16 are called supplementary planes or astral planes.

Unicode encodings

UTF-32 (Unicode Transformation Format 32) is a format with 32 bit code units. Any code point can be encoded by a single code unit, making this the only fixed-length encoding. For other encodings, the number of units needed to encode a point varies.

UTF-16 is a format with 16 bit code units that needs one to two units to represent a code point. BMP code points can be represented by single code units. Higher code points are 20 bit, after subtracting 0x10000 (the range of the BMP). These bits are encoded as two code units:

  • Lead surrogate – most significant 10 bits: stored in the range 0xD800–0xDBFF (four times 8 bits = 4 × two hexadecimal digits).
  • Tail surrogate – least significant 10 bits: stored in the range 0xDC00–0xDFFF (four times 8 bits = 4 × two hexadecimal digits).
To enable this encoding scheme, the BMP has a hole with unused code points whose range is 0xD800–0xDFFF. Therefore the ranges of lead surrogates, tail surrogates and BMP code points are disjoint, making decoding robust in the face of errors. The following function encodes a code point as UTF-16. An example of using it is given later.
    function toUTF16(codePoint) {
        var TEN_BITS = parseInt('1111111111', 2);
        function u(codeUnit) {
            return '\\u'+codeUnit.toString(16).toUpperCase();
        }

        if (codePoint <= 0xFFFF) {
            return u(codePoint);
        }
        codePoint -= 0x10000;
        
        // Shift right to get to most significant 10 bits
        var leadSurrogate = 0xD800 + (codePoint >> 10);

        // Mask to get least significant 10 bits
        var tailSurrogate = 0xDC00 + (codePoint & TEN_BITS);

        return u(leadSurrogate) + u(tailSurrogate);
    }

UCS-2, a deprecated format, uses 16 bit code units to represent (only!) the code points of the BMP. When the range of Unicode code points expanded beyond 16 bits, UTF-16 replaced UCS-2.

UTF-8. UTF-8 has 8 bit code units. It builds a bridge between the legacy ASCII encoding and Unicode. ASCII only has 128 characters, whose numbers are the same as the first 128 Unicode code points. UTF-8 is backwards compatible, because all ASCII characters are valid code units. In other words, a single code unit in the range 0–127 encodes a single code point in the same range. Such code units are marked by their highest bit being zero. If, on the other hand, the highest bit is one then more units will follow, to provide the additional bits for the higher code points. That leads to the following encoding scheme:

  • 0000–007F: 0xxxxxxx (7 bits, stored in 1 byte)
  • 0080–07FF: 110xxxxx, 10xxxxxx (5+6 bits = 11 bits, stored in 2 bytes)
  • 0800–FFFF: 1110xxxx, 10xxxxxx, 10xxxxxx (4+6+6 bits = 16 bits, stored in 3 bytes)
  • 10000–1FFFFF: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx (3+6+6+6 bits = 21 bits, stored in 4 bytes)
    (The highest code point is 10FFFF, so UTF-8 has some extra room.)
If the highest bit is not 0 then the number of ones before the zero indicates how many code units there are in a sequence. All code units after the initial one have the bit prefix 10. Therefore, the ranges of initial code units and subsequent code units are disjoint, which helps with recovering from encoding errors.

UTF-8 has become the most popular Unicode format. Initially, due to its backwards compatibility with ASCII. Later, due to its broad support across operating systems, programming environments and applications.

JavaScript source code and Unicode

Source code internally

Internally, JavaScript source code is treated as a sequence of UTF-16 code units. Quoting Sect. 6 of the EMCAScript specification:
ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.
In identifiers, string literals and regular expression literals, any code unit can also be expressed via a Unicode escape sequence \uHHHH, where HHHH are four hexadecimal digits. For example:
    > var f\u006F\u006F = 'abc';
    > foo
    'abc'

    > var λ = 123;
    > \u03BB
    123
That means that you can use Unicode characters in literals and variable names, without leaving the ASCII range in the source code.

In string literals, an additional kind of escape is available: hex escape sequences with two-digit hexadecimal numbers that represent code units in the range 0x00–0xFF. For example:

    > '\xF6' === 'ö'
    true
    > '\xF6' === '\u00F6'
    true

Source code externally

While that format is used internally, JavaScript source code is usually not stored as UTF-16. When a web browser loads a source file via a script tag, it determines the encoding as follows:
  • If there is a BOM, the encoding is a UTF variant, depending on what BOM is used.
  • Otherwise, if the source code is loaded via HTTP(S) then the Content-Type header can specify an encoding, via the charset parameter. For example:
        Content-Type: application/javascript; charset=utf-8
    
    Note: the correct media type (formerly known as MIME type) for JavaScript files is application/javascript. However, older browsers (e.g. Internet Explorer 8 and earlier) work most reliably with text/javascript. Unfortunately, the default value for the attribute type of <script> tags is text/javascript. At least, you can omit that attribute for JavaScript; there is no benefit in including it.
  • Otherwise, if the script tag has the attribute charset then that encoding is used. Even though the attribute type holds a valid media type, that type must not have the parameter charset (like in the Content-Type header, above).
  • Otherwise, the encoding of the document is used, in which the script tag resides. For example, this is the beginning of an HTML5 document, where a meta tag declares that the document is encoded as UTF-8.
        <!doctype html>
        <html>
        <head>
            <meta charset="UTF-8">
        ...
    
    It is highly recommended to always specify an encoding. If you don’t, a locale-specific default encoding is used. That is, people will see the file differently in different countries. Only the lowest 7 bit are relatively stable across locales.
Recommendations:
  • For your own application, you can use Unicode. But you must specify the encoding of the app’s HTML page as UTF-8.
  • For libraries, it’s safest to release code that is ASCII (7 bit).
Some minification tools can translate source with Unicode code points beyond 7 bit to source that is “7 bit clean”. They do so by replacing non-ASCII characters with Unicode escapes. For example, the following invocation of UglifyJS translates the file test.js:
    uglifyjs -b beautify=false,ascii-only=true test.js
The file test.js looks like this:
    var σ = 'Köln';
The output of UglifyJS looks like this:
    var \u03c3="K\xf6ln";
Negative example: For a while, the library D3.js was published in UTF-8. That caused an error when it was loaded from a page whose encoding was not UTF-8, because the code contained statements such as
    var π = Math.PI, ε = 1e-6;
The identifiers π and ε were not decoded correctly and not recognized as valid variable names. Additionally, some string literals with code points beyond 7 bit weren’t decoded correctly, either. As a work-around, the code could be loaded by adding the appropriate charset attribute to the script tag:
    <script charset="utf-8" src="d3.js"></script>

JavaScript strings and Unicode

A JavaScript string is a sequence of UTF-16 code points. Quoting the ECMAScript specification, Sect. 8.4:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
Escape sequences. As mentioned before, you can use Unicode escape sequences and hex escape sequences in string literals. For example, you can produce the character “ö” by combining an “o” with a diaeresis (code point 0x0308):
    > console.log('o\u0308')
    ö
This works in command lines, such as web browser consoles and the Node.js REPL in a terminal. You can also insert this kind of string into the DOM of a web page.

Refering to astral plane characters via escapes. There are many nice Unicode symbol tables on the web. Take a look at Tim Whitlock’s “Emoji Unicode Tables” and be amazed by how many symbols there are in modern Unicode fonts. None of the symbols in the table are images, they are all font glyphs. Let’s assume you want to display a character via JavaScript that is in an astral plane. For example, a cow (code point 0x1F404):

🐄
You can either copy the character and paste it directly into your Unicode-encoded JavaScript source:
    var str = '🐄';
JavaScript engines will decode the source (which is most often in UTF-8) and create a string with two UTF-16 code units. Alternatively, you can compute the two code units yourself and use Unicode escape sequences. There are web apps that perform this computation: The previously defined function toUTF16 performs it, too:
    > toUTF16(0x1F404)
    '\\uD83D\\uDC04'
The UTF-16 surrogate pair (0xD83D, 0xDC04) does indeed encode the cow:
    > console.log('\uD83D\uDC04')
    🐄

Counting characters. If a string contains a surrogate pair (two code units encoding a single code point) then the length property doesn’t count characters, any more. It counts code units:

    > var str = '🐄';
    > str === '\uD83D\uDC04'
    true
    > str.length
    2
This can be fixed via libraries, such as Mathias Bynens’ Punycode.js, which is bundled with Node.js:
    > var puny = require('punycode');
    > puny.ucs2.decode(str).length
    1

Unicode normalization. If you want to search in strings or compare them then you need to normalize, e.g. via the library unorm (by Bjarke Walling).

JavaScript regular expressions and Unicode

Support for Unicode in JavaScript’s regular expressions [1] is very limited. For example, there is no way to match Unicode categories such as “uppercase letter”.

Line terminators influence matching and do have a Unicode definition. A line terminator is either one of four characters:

Code unitNameCharacter escape sequence
\u000ALine feed\n
\u000DCarriage return\r
\u2028Line separator
\u2029Paragraph separator

The following regular expression constructs support Unicode:

  • Whitespace (\s) and non-whitespace (\S) have Unicode-based definitions:
        > /^\s$/.test('\uFEFF')
        true
    
  • The dot (.) matches all code points (not code units!) except line terminators. See below how to match any code unit.
  • In multiline mode, the assertion ^ matches at the beginning of the input and after line terminators, the assertion $ matches before line terminators and at the end of the input. Otherwise, they only match at the beginning or end of the input, respectively.
Other important character classes have definitions that are based on ASCII, not on Unicode:
  • \d matches digits, \D matches non-digits, where a digit is equivalent to [0-9]
  • \w matches word characters, \W matches non-word characters, where a word character is equivalent to [A-Za-z0-9_]
  • \b matches at word breaks, \B matches inside words, where words are sequences of word characters ([A-Za-z0-9_]). Example: In the string 'über', the character class escape \b sees the character “b” as starting a word.
        > /\bb/.test('über')
        true
    

Matching any code unit

To match any code unit, you can use [\s\S], see [1]. To match any code point, you need to use:
    ([\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
The above pattern works like this:
    ([BMP code point]|[lead surrogate][tail surrogate])
As all of these ranges are disjoint, the pattern will correctly match code units in well-formed UTF-16 strings.

Libraries

Regenerate helps with generating ranges like the one above, for matching any code unit. It is meant to be used as part of a built tool, but also works dynamically, for trying out things.

XRegExp is a regular expression library that has an official addon for matching Unicode categories, scripts, blocks and properties via one of the following three constructs:

    \p{...} \P{...} \p{^...}
For example, \p{Letter} matches letters in various alphabets.

The future of handling Unicode in JavaScript

Two new standards, one that is in the process of being implemented and another one that is in the process of being designed will bring better support for Unicode to JavaScript:
  • The ECMAScript Internationalization API [2]: offers Unicode-based collation (sorting and searching) and more.
  • ECMAScript 6: The next version of JavaScript will have several Unicode-related features, such as escapes for arbitrary code points and a method for accessing code points in a string (as opposed to code units). The blog post “Supplementary Characters for ECMAScript” by Norbert Lindenberg explains the plans for Unicode support in ECMAScript 6.

Recommended reading and sources of this post

Information on Unicode: Information on Unicode support in JavaScript:

Acknowledgements

The following people helped with this blog post: Mathias Bynens (@mathias), Anne van Kesteren ‏(@annevk), Calvin Metcalf ‏(@CWMma).

References

  1. JavaScript: an overview of the regular expression API
  2. The ECMAScript Internationalization API

No comments: