This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of WP status.
Section: 28.5.6.5 [format.string.escaped] Status: WP Submitter: Tom Honermann Opened: 2023-07-31 Last modified: 2023-11-22
Priority: Not Prioritized
View all issues with WP status.
Discussion:
The C++23 DIS contains the following example in 28.5.6.5 [format.string.escaped] p3. (This example does not appear in the most recent N4950 WP or on https://eel.is/c++draft because the project editor has not yet merged changes needed to support rendering of some of the characters involved).
string s6 = format("[{:?}]", "🤷♂️"); // s6 has value: ["🤷\u{200d}♂\u{fe0f}"]
The character to be formatted (🤷♂️) consists of the following sequence of code points in the order presented:
U+1F937 (SHRUG)
U+200D (ZERO WIDTH JOINER)
U+2642 (MALE SIGN)
U+FE0F (VARIATION SELECTOR-16)
28.5.6.5 [format.string.escaped] bullet 2.2.1 specifies which code points are to be formatted as a
\u{hex-digit-sequence}
escape sequence:
(2.2.1) — If X encodes a single character C, then:
(2.2.1.1) — If C is one of the characters in Table 75 [tab:format.escape.sequences], then the two characters shown as the corresponding escape sequence are appended to E.
(2.2.1.2) — Otherwise, if C is not U+0020 SPACE and
(2.2.1.2.1) — CE is UTF-8, UTF-16, or UTF-32 and C corresponds to a Unicode scalar value whose
Unicode property General_Category
has a value in the groups Separator
(Z
) or Other
(C
), as described by UAX #44 of the Unicode Standard, or
(2.2.1.2.2) — CE is UTF-8, UTF-16, or UTF-32 and C corresponds to a Unicode scalar value with
the Unicode property Grapheme_Extend=Yes
as described by UAX #44 of the Unicode
Standard and C is not immediately preceded in S by a character P appended to E without
translation to an escape sequence, or
(2.2.1.2.3) — CE is neither UTF-8, UTF-16, nor UTF-32 and C is one of an implementation-defined set of separator or non-printable characters
then the sequence \u{hex-digit-sequence}
is appended to E, where hex-digit-sequence
is the shortest hexadecimal representation of C using lower-case hexadecimal digits.
(2.2.1.3) — Otherwise, C is appended to E.
The example is not consistent with the above specification for the final code point.
U+FE0F is a single character,
is not one of the characters in Table 75, is not U+0020, has a General_Category
of Nonspacing Mark (Mn)
which is neither Z
nor C
, has Grapheme_Extend=Yes
but the prior character (U+2642) is not
formatted as an escape sequence, and is not one of an implementation-defined set of separator or non-printable characters
(for the purposes of this example; the example assumes a UTF-8 encoding). Thus, formatting for this character falls to
the last bullet point and the character should be appended as is (without translation to an escape sequence).
Since this character is a combining character, it should combine with the previous character and thus alter the
appearance of U+2642 (thus producing "♂️"
instead of "♂\u{fe0f}"
).
[2023-10-27; Reflector poll]
Set status to Tentatively Ready after six votes in favour during reflector poll.
[2023-11-11 Approved at November 2023 meeting in Kona. Status changed: Voting → WP.]
Proposed resolution:
This wording is relative to N4950 plus missing editorial pieces from P2286R8.
Modify the example following 28.5.6.5 [format.string.escaped] p3 as indicated:
[Drafting note: The presented example was voted in as part of P2286R8 during the July 2022 Virtual Meeting but is not yet accessible in the most recent working draft N4950.
Note that the final character (♂️) is composed from the two code points U+2642 and U+FE0F. ]
string s6 = format("[{:?}]", "🤷♂️"); // s6 has value:["🤷\u{200d}♂\u{fe0f}"]["🤷\u{200d}♂️"]