2331. regex_constants::collate's effects are inaccurately summarized

Section: 31.5.1 [re.synopt] Status: Open Submitter: Stephan T. Lavavej Opened: 2013-09-21 Last modified: 2016-02-10

Priority: 3

View all other issues in [re.synopt].

View all issues with Open status.

Discussion:

The table in 31.5.1 [re.synopt]/1 says that regex_constants::collate "Specifies that character ranges of the form "[a-b]" shall be locale sensitive.", but 31.13 [re.grammar]/14 says that it affects individual character comparisons too.

[2012-02-12 Issaquah : recategorize as P3]

Marshall Clow: 28.13/14 only applies to ECMAScript

All: we're unsure

Jonathan Wakely: we should ask John Maddock

Move to P3

[2014-5-14, John Maddock response]

The original intention was the original wording: namely that collate only made character ranges locale sensitive. To be frank it's a feature that's probably hardly ever used (though I have no real hard data on that), and is a leftover from early POSIX standards which required locale sensitive collation for character ranges, and then later changed to implementation defined if I remember correctly (basically nobody implemented locale-dependent collation).

So I guess the question is do we gain anything by requiring all character-comparisons to go through the locale when this bit is set? Certainly it adds a great deal to the implementation effort (it's not what Boost.Regex has ever done). I guess the question is are differing code-points that collate identically an important use case? I guess there might be a few Unicode code points that do that, but I don't know how to go about verifying that.

STL:

If this was unintentional, then 31.5.1 [re.synopt]/1's table should be left alone, while 31.13 [re.grammar]/14 should be changed instead.

Jeffrey Yasskin:

This page mentions that [V] in Swedish should match "W" in a perfect world.

However, the most recent version of TR18 retracts both language-specific loose matches and language-specific ranges because "for most full-featured regular expression engines, it is quite difficult to match under code point equivalences that are not 1:1" and "tailored ranges can be quite difficult to implement properly, and can have very unexpected results in practice. For example, languages may also vary whether they consider lowercase below uppercase or the reverse. This can have some surprising results: [a-Z] may not match anything if Z < a in that locale."

ECMAScript doesn't include collation at all.

IMO, +1 to changing 28.13 instead of 28.5.1. It seems like we'd be on fairly solid ground if we wanted to remove regex_constants::collate entirely, in favor of named character classes, but of course that's not for this issue.

Proposed resolution:

This wording is relative to N3691.

  1. In 31.5.1 [re.synopt]/1, Table 138 — "syntax_option_type effects", change as indicated:

    Table 138 — syntax_option_type effects
    Element Effect(s) if set
    collate Specifies that character ranges of the form "[a-b]"comparisons and character range comparisons shall be locale sensitive.