CWG Issue 872

This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 117a. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2025-04-13

872. Lexical issues with raw strings

Section: 5.13.5 [lex.string] Status: CD2 Submitter: Joseph Myers Date: 16 April, 2009

[Voted into WP at March, 2010 meeting as document N3077.]

The specification of raw string literals interacts poorly with the specification of preprocessing tokens. The grammar in 5.5 [lex.pptoken] has a production reading

each non-white-space character that cannot be one of the above

This is echoed in the max-munch rule in paragraph 3:

If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

This raises questions about the handling of raw string literals. Consider, for instance,

    #define R "x"
    const char* s = R"y";

The character sequence R"y" does not satisfy the syntactic requirements for a raw string. Should it be diagnosed as an ill-formed attempt at a raw string, or should it be well-formed, interpreting R as a preprocessor token that is a macro name and thus initializing s with a pointer to the string "xy"?

For another example, consider:

    #define R "]"
    const char* x = R"foo[";

Presumably this means that the entire rest of the file must be scanned for the characters ]foo" and, if they are not found, macro-expand R and initialize x with a pointer to the string "]foo[". Is this the intended result?

Finally, does the requirement in 5.13.5 [lex.string] that

A d-char-sequence shall consist of at most 16 characters.

mean that

    #define R "x"
    const char* y = R"12345678901234567[y]12345678901234567";

is ill-formed, or a valid initialization of y with a pointer to the string "x12345678901234567[y]12345678901234567"?

Additional note, June, 2009:

The translation of characters that are not in the basic source character set into universal-character-names in translation phase 1 raises an additional problem: each such character will occupy at least six of the 16 r-chars that are permitted. Thus, for example, R"@@@[]@@@" is ill-formed because @@@ becomes \u0040\u0040\u0040, which is 18 characters.

One possibility for addressing this might be to disallow the \ character completely as an d-char, which would have the effect of restricting r-chars to the basic source character set.

Proposed resolution (October, 2009):

Change the grammar in 5.13.5 [lex.string] as follows:

d-char:

[

]

the backslash \,

Change 5.13.5 [lex.string] paragraph 2 as follows:

A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters. If the input stream contains a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", those characters are considered to begin a raw string literal even if that literal is not well-formed. [Example:
  #define R "x"
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
—end example]