This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 116a. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.
2024-12-19
[Moved to DR at the November, 2014 meeting.]
The intent of char16_t string literals, as evident from 5.13.5 [lex.string] paragraph 9, is that they be encoded in UTF-16, that is, including surrogate pairs to represent code points outside the basic multi-lingual plane:
A single c-char may produce more than one char16_t character in the form of surrogate pairs.
Paragraph 15, however, is inconsistent with this approach, saying,
Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (5.13.3 [lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \.
The reason is that code points outside the basic multi-lingual plane are ill-formed in char16_t character literals:
A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed.
It should be clarified that this restriction does not apply to char16_t string literals.
Proposed resolution (February, 2014):
Change 5.13.5 [lex.string] paragraph 16 as follows:
Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (5.13.3 [lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a char16_t string literal may yield a surrogate pair. In a narrow string literal...