This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 110c. See for the official list.


1335. Stringizing, extended characters, and universal-character-names

Section: 15.6.3  [cpp.stringize]     Status: CD6     Submitter: Johannes Schaub     Date: 2011-07-03     Liaison: WG14

[Resolved at the October, 2021 meeting by paper P2314R4.]

When a string literal containing an extended character is stringized (15.6.3 [cpp.stringize]), the result contains a universal-character-name instead of the original extended character. The reason is that the extended character is translated to a universal-character-name in translation phase 1 (5.2 [lex.phases]), so that the string literal "@" (where @ represents an extended character) becomes "\uXXXX". Because the preprocessing token is a string literal, when the stringizing occurs in translation phase 4, the \ is doubled, and the resulting string literal is "\"\\uXXXX\"". As a result, the universal-character-name is not recognized as such when the translation to the execution character set occurs in translation phase 5. (Note that phase 5 translation does occur if the stringized extended character does not appear in a string literal.) Existing practice appears to ignore these rules and preserve extended characters in stringized string literals, however.

See also issue 578.

Additional note (August, 2013):

Implementations are granted substantial latitude in their handling of extended characters and universal-character-names in 5.2 [lex.phases] paragraph 1 phase 1, i.e.,

(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

However, this freedom is mostly nullified by the requirements of stringizing in 15.6.3 [cpp.stringize] paragraph 2:

If, in the replacement list, a parameter is immediately preceded by a # preprocessing token, both are replaced by a single character string literal preprocessing token that contains the spelling of the preprocessing token sequence for the corresponding argument.

This means that, in order to handle a construct like

  #define STRINGIZE_LITERAL( X ) # X

  STRINGIZE( STRINGIZE( identifier_\u00fC\U000000Fc ) )

an implementation must recall the original spelling, including the form of UCN and the capitalization of any non-numeric hexadecimal digits, rather than simply translating the characters into a convenient internal representation.

To effect the freedom asserted in 5.2 [lex.phases], the description of stringizing should make the spelling of a universal-character-name implementation-defined.

Additional note (February, 2022):

P2314R4 Character sets and encodings (approved in October, 2021) effected changes so that extended characters are no longer translated to UCNs in phase 1.