This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 115e. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2024-11-11


1332. Handling of invalid universal-character-names

Section: 5.3  [lex.charset]     Status: CD5     Submitter: Mike Miller     Date: 2011-06-20

According to 5.3 [lex.charset] paragraph 2,

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00-0x1F or 0x7F-0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.

It is not specified what should happen if the hexadecimal value does not designate a Unicode code point: is that undefined behavior or does it make the program ill-formed?

As an aside, a note should be added explaining why these requirements apply to to an r-char-sequence when, as the footnote at the end of the paragraph explains,

A sequence of characters resembling a universal-character-name in an r-char-sequence (5.13.5 [lex.string]) does not form a universal-character-name.

Additional note, February, 2021:

This issue was resolved editorially in N4842.