This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 115e. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2024-11-11


1802. char16_t string literals and surrogate pairs

Section: 5.13.5  [lex.string]     Status: CD4     Submitter: Jeffrey Yasskin     Date: 2013-10-30

[Moved to DR at the November, 2014 meeting.]

The intent of char16_t string literals, as evident from 5.13.5 [lex.string] paragraph 9, is that they be encoded in UTF-16, that is, including surrogate pairs to represent code points outside the basic multi-lingual plane:

A single c-char may produce more than one char16_t character in the form of surrogate pairs.

Paragraph 15, however, is inconsistent with this approach, saying,

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (5.13.3 [lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \.

The reason is that code points outside the basic multi-lingual plane are ill-formed in char16_t character literals:

A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed.

It should be clarified that this restriction does not apply to char16_t string literals.

Proposed resolution (February, 2014):

Change 5.13.5 [lex.string] paragraph 16 as follows:

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (5.13.3 [lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a char16_t string literal may yield a surrogate pair. In a narrow string literal...