This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 115e. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2024-11-11


872. Lexical issues with raw strings

Section: 5.13.5  [lex.string]     Status: CD2     Submitter: Joseph Myers     Date: 16 April, 2009

[Voted into WP at March, 2010 meeting as document N3077.]

The specification of raw string literals interacts poorly with the specification of preprocessing tokens. The grammar in 5.4 [lex.pptoken] has a production reading

This is echoed in the max-munch rule in paragraph 3:

If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

This raises questions about the handling of raw string literals. Consider, for instance,

    #define R "x"
    const char* s = R"y";

The character sequence R"y" does not satisfy the syntactic requirements for a raw string. Should it be diagnosed as an ill-formed attempt at a raw string, or should it be well-formed, interpreting R as a preprocessor token that is a macro name and thus initializing s with a pointer to the string "xy"?

For another example, consider:

    #define R "]"
    const char* x = R"foo[";

Presumably this means that the entire rest of the file must be scanned for the characters ]foo" and, if they are not found, macro-expand R and initialize x with a pointer to the string "]foo[". Is this the intended result?

Finally, does the requirement in 5.13.5 [lex.string] that

A d-char-sequence shall consist of at most 16 characters.

mean that

    #define R "x"
    const char* y = R"12345678901234567[y]12345678901234567";

is ill-formed, or a valid initialization of y with a pointer to the string "x12345678901234567[y]12345678901234567"?

Additional note, June, 2009:

The translation of characters that are not in the basic source character set into universal-character-names in translation phase 1 raises an additional problem: each such character will occupy at least six of the 16 r-chars that are permitted. Thus, for example, R"@@@[]@@@" is ill-formed because @@@ becomes \u0040\u0040\u0040, which is 18 characters.

One possibility for addressing this might be to disallow the \ character completely as an d-char, which would have the effect of restricting r-chars to the basic source character set.

Proposed resolution (October, 2009):

  1. Change the grammar in 5.13.5 [lex.string] as follows:

  2. Change 5.13.5 [lex.string] paragraph 2 as follows:

  3. A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters. If the input stream contains a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", those characters are considered to begin a raw string literal even if that literal is not well-formed. [Example:

      #define R "x"
      const char* s = R"y"; // ill-formed raw string, not "x" "y"
    
    

    end example]