The proposal “RegExp Unicode Property Escapes” by Mathias Bynens is at stage 4. This blog post explains how it works.
JavaScript lets you match characters by mentioning the “names” of sets of characters. For example, \s stands for “whitespace”:
> /^\s+$/u.test('\t \n\r')
true
The proposal lets you additionally match characters by mentioning their Unicode character properties (what those are is explained next) inside the curly braces of \p{}. Two examples:
> /^\p{White_Space}+$/u.test('\t \n\r')
true
> /^\p{Script=Greek}+$/u.test('μετά')
true
As you can see, one of the benefits of property escapes is is that they make regular expressions more self-descriptive. Additional benefits will become clear later.
Before we delve into how property escapes work, let’s examine what Unicode character properties are.
In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:
The semantics of a character are determined by its identity, normative properties, and behavior.
These are a few examples of properties:
Name: a unique name, composed of uppercase letters, digits, hyphens and spaces. For example:
Name = LATIN CAPITAL LETTER AName = GRINNING FACEGeneral_Category: categorizes characters. For example:
General_Category = Lowercase_LetterGeneral_Category = Currency_SymbolWhite_Space: used for marking invisible spacing characters, such as spaces, tabs and newlines. For example:
White_Space = TrueWhite_Space = FalseAge: version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard.
Age = 2.1Block: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
Block = Basic_Latin (range U+0000..U+007F)Block = Cyrillic (range U+0400..U+04FF)Script: is a collection of characters used by one or more writing systems.
Script = GreekScript = HebrewThe following types of properties exist:
General_Category is an enumerated property.True and False. Boolean properties are also called binary, because they are like markers that characters either have or not. White_Space is a binary property.Age and Script are catalog properties.Name is a miscellaneous property.Properties and property values are matched as follows:
"General_Category", "general category", "-general-category-", "GeneralCategory" are all considered to be the same property.PropertyAliases.txt and PropertyValueAliases.txt define alternative ways of referring to properties and property values.
General_CategorygcLowercase_Letter, LlCurrency_Symbol, ScTrue, T, Yes, YFalse, F, No, NUnicode property escapes look like this:
prop has the value value:\p{prop=value}
prop whose value is value:\P{prop=value}
bin_prop is True:\p{bin_prop}
bin_prop is False:\P{bin_prop}
Forms (3) and (4) can also be used as an abbreviation for General_Category. For example: \p{Lowercase_Letter} is an abbreviation for \p{General_Category=Lowercase_Letter}
Important: In order to use property escapes, regular expressions must have the flag /u. Prior to /u, \p is the same as p.
Things to note:
PropertyAliases.txt and PropertyValueAliases.txtGeneral_CategoryScriptScript_ExtensionsAlphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, Any, ASCII, Assigned, ID_Start, ID_Continue, Join_Control, Emoji_Presentation, Emoji_Modifier, Emoji_Modifier_Base.Matching whitespace:
> /^\p{White_Space}+$/u.test('\t \n\r')
true
Matching letters:
> /^\p{Letter}+$/u.test('πüé')
true
Matching Greek letters:
> /^\p{Script=Greek}+$/u.test('μετά')
true
Matching Latin letters:
> /^\p{Script=Latin}+$/u.test('Grüße')
true
> /^\p{Script=Latin}+$/u.test('façon')
true
> /^\p{Script=Latin}+$/u.test('mañana')
true
Matching lone surrogate characters:
> /^\p{Surrogate}+$/u.test('\u{D83D}')
true
> /^\p{Surrogate}+$/u.test('\u{DE00}')
true
Note that Unicode code points in astral planes (such as emojis) are composed of two JavaScript characters (a leading surrogate and a trailing surrogate). Therefore, you’d expect the previous regular expression to match the emoji 😀, which is all surrogates:
> '😀'.length
2
> '😀'.charCodeAt(0).toString(16)
'd83d'
> '😀'.charCodeAt(1).toString(16)
'de00'
However, with the /u flag, property escapes match code points, not JavaScript characters:
> /^\p{Surrogate}+$/u.test('😀')
false
In other words, 😀 is considered to be a single character:
> /^.$/u.test('😀')
true
V8 5.8+ implement this proposal, it is switched on via --harmony_regexp_property:
node --harmony_regexp_property
npm versionchrome://version//Applications/Google Chrome.app/Contents/MacOS/Google Chrome'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --js-flags="--harmony_regexp_property"JavaScript:
/u (unicode)” (in “Exploring ES6”)The Unicode standard:
PropList.txt, PropertyAliases.txt, PropertyValueAliases.txt