This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 113d. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2024-03-20


2455. Concatenation of string literals vs translation phases 5 and 6

Section: 5.2  [lex.phases]     Status: CD6     Submitter: Tom Honermann     Date: 2020-07-02

[Addressed by paper P2314R4, adopted at the October, 2021 plenary.]

According to 5.2 [lex.phases] paragraph 1, concatenation of adjacent string literals is performed in translation phase 6, after conversion of the literal values to the execution character set. However, 5.13.5 [lex.string] paragraph 11 indicates that the interpretation of the string contents is dependent on the encoding-prefixes specified for the literals being concatenated:

In translation phase 6 (5.2 [lex.phases]), adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix. If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. [Note: This concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string-literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. —end note]

This seems to indicate that string-literals with different encoding-prefixes are separately converted and then joined, potentially resulting in strings containing code unit sequences corresponding to different character encodings. This reading would contradict the intent, expressed in adjacent table, that, e.g., u"a" "b" means the same as u"ab".

There is implementation divergence in the handling of this specification.

Phases 5 and 6 cannot simply be reversed, because interpretation of escape sequences must precede concatenation, as specified later in the same paragraph:

Characters in concatenated strings are kept distinct.

[Example:

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB'). —end example]

Richard Smith suggested here that "we should remove phases 5 and 6 entirely, parse one or more string-literal tokens as a string literal expression, and only perform the translation from the contents of the string literal tokens into characters in the execution character set as part of specifying the semantics of a string literal expression."