523. regex case-insensitive character ranges are unimplementable as specified

Section: 31 [re] Status: LEWG Submitter: Eric Niebler Opened: 2005-07-01 Last modified: 2016-02-10

Priority: Not Prioritized

View other active issues in [re].

View all other issues in [re].

View all issues with LEWG status.

Discussion:

A problem with TR1 regex is currently being discussed on the Boost developers list. It involves the handling of case-insensitive matching of character ranges such as [Z-a]. The proper behavior (according to the ECMAScript standard) is unimplementable given the current specification of the TR1 regex_traits<> class template. John Maddock, the author of the TR1 regex proposal, agrees there is a problem. The full discussion can be found at http://lists.boost.org/boost/2005/06/28850.php (first message copied below). We don't have any recommendations as yet.

-- Begin original message --

The situation of interest is described in the ECMAScript specification (ECMA-262), section 15.10.2.15:

"Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well as the symbols [, \, ], ^, _, and `."

A more interesting case is what should happen when doing a case-insentitive match on a range such as [Z-a]. It should match z, Z, a, A and the symbols [, \, ], ^, _, and `. This is not what happens with Boost.Regex (it throws an exception from the regex constructor).

The tough pill to swallow is that, given the specification in TR1, I don't think there is any effective way to handle this situation. According to the spec, case-insensitivity is handled with regex_traits<>::translate_nocase(CharT) -- two characters are equivalent if they compare equal after both are sent through the translate_nocase function. But I don't see any way of using this translation function to make character ranges case-insensitive. Consider the difficulty of detecting whether "z" is in the range [Z-a]. Applying the transformation to "z" has no effect (it is essentially std::tolower). And we're not allowed to apply the transformation to the ends of the range, because as ECMA-262 says, "the case of the two ends of a range is significant."

So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix is to redefine translate_nocase to return a string_type containing all the characters that should compare equal to the specified character. But this function is hard to implement for Unicode, and it doesn't play nice with the existing ctype facet. What a mess!

-- End original message --

[ John Maddock adds: ]

One small correction, I have since found that ICU's regex package does implement this correctly, using a similar mechanism to the current TR1.Regex.

Given an expression [c1-c2] that is compiled as case insensitive it:

Enumerates every character in the range c1 to c2 and converts it to it's case folded equivalent. That case folded character is then used a key to a table of equivalence classes, and each member of the class is added to the list of possible matches supported by the character-class. This second step isn't possible with our current traits class design, but isn't necessary if the input text is also converted to a case-folded equivalent on the fly.

ICU applies similar brute force mechanisms to character classes such as [[:lower:]] and [[:word:]], however these are at least cached, so the impact is less noticeable in this case.

Quick and dirty performance comparisons show that expressions such as "[X-\\x{fff0}]+" are indeed very slow to compile with ICU (about 200 times slower than a "normal" expression). For an application that uses a lot of regexes this could have a noticeable performance impact. ICU also has an advantage in that it knows the range of valid characters codes: code points outside that range are assumed not to require enumeration, as they can not be part of any equivalence class. I presume that if we want the TR1.Regex to work with arbitrarily large character sets enumeration really does become impractical.

Finally note that Unicode has:

Three cases (upper, lower and title). One to many, and many to one case transformations. Character that have context sensitive case translations - for example an uppercase sigma has two different lowercase forms - the form chosen depends on context(is it end of a word or not), a caseless match for an upper case sigma should match either of the lower case forms, which is why case folding is often approximated by tolower(toupper(c)).

Probably we need some way to enumerate character equivalence classes, including digraphs (either as a result or an input), and some way to tell whether the next character pair is a valid digraph in the current locale.

Hoping this doesn't make this even more complex that it was already,

[ Portland: Alisdair: Detect as invalid, throw an exception. Pete: Possible general problem with case insensitive ranges. ]

[ 2009-07 Frankfurt ]

We agree that this is a problem, but we do not know the answer.

We are going to declare this NAD until existing practice leads us in some direction.

No objection to NAD Future.

Move to NAD Future.

Proposed resolution: