This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of Open status.

4070. Transcoding by std::formatter<std::filesystem::path>

Section: 31.12.6.9.2 [fs.path.fmtr.funcs] Status: Open Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2025-09-12

Priority: 2

View all issues with Open status.

Discussion:

31.12.6.9.2 [fs.path.fmtr.funcs] says:

If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the ? option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per 28.5.6.5 [format.string.escaped]:

Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence \x{hex-digit-sequence}, where hex-digit-sequence is the shortest hexadecimal representation of U using lower-case hexadecimal digits.

So only unescaped strings can have ill-formed sequences by the time we do transcoding to char, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as \x{hex-digit-sequence}. So an ill-formed sequence of two wchar_t values will be escaped as two \x{...} strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as \x{fffd}. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

[2024-05-08; Reflector poll]

Set priority to 2 after reflector poll.

Previous resolution [SUPERSEDED]:

This wording is relative to N4981.

  1. Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    
    -5- Effects: Let s be p.generic_string<filesystem::path::value_type>() if the g option is used, otherwise p.native(). Writes s into ctx.out(), adjusted according to the path-format-spec. If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If charT and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted path when charT and path::value_type differ and not converting from wchar_t to UTF-8

[2025-06-11; SG16 comments and improves wording]

The "and not converting from wchar_t to UTF-8" wording added in the index of implementation-defined behavior by the current proposed resolution should be changed to "and the literal encoding is not UTF-8".

It was noted that "the literal encoding" is ambiguous in both the normative wording in 31.12.6.9.2 [fs.path.fmtr.funcs] p5 and in the new wording quoted above. In both cases, the intent is to refer to the "ordinary literal encoding". However, some SG16 participants were reluctant to include a drive-by fix with the proposed resolution for this issue since the ambiguous literal encoding reference i s a pre-existing and separable issue. Those same SG16 participants were more concerned that the same wording was used in both 31.12.6.9.2 [fs.path.fmtr.funcs] p5 and in the corresponding entry of the implementation-defined behavior index. I would defer to the LWG chair to decide whether to address this as an additional related clarification with this change or as a separate editorial or LWG issue.

The minimal change is to replace "and not converting from wchar_t to UTF-8" with "and the literal encoding is not UTF-8". The optional change is to insert "ordinary" before "literal encoding" as well. Once that is done, I'll have SG16 confirm they are content with the new proposed resolution.

Previous resolution [SUPERSEDED]:

This wording is relative to N5008.

  1. Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    

    -5- Effects: Let s be p.generic_string<filesystem::path::value_type>() if the g option is used, otherwise p.native(). Writes s into ctx.out(), adjusted according to the path-format-spec. If charT is char, path::value_type is wchar_t, and the ordinary literal encoding is UTF-8, then the escaped path (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If charT and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.

  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted path when charT and path::value_type differ and the ordinary literal encoding is not UTF-8

[2025-07-30; SG16 meeting]

SG16 unanimously approved new wording produced during the discussion. The group concluded that the intended behavior would be best specified by introducing additional names to denote the sequence of transformations that produce the intended effect. Status → Open.

Proposed resolution:

This wording is relative to N5014.

  1. Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    

    -5- Effects: Let s be p.generic_string<filesystem::path::value_type>() if the g option is used, otherwise p.native(). Let s2 be s adjusted according to the path-format-spec. Let s3 be defined as follows:

    1. (5.1) — If charT is char, path::value_type is wchar_t, and the ordinary literal encoding is UTF-8, s3 is the result of transcoding s2 from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with U+FFFD REPLACEMENT CHARACTER per the Unicode Standard, Chapter 3.9 U+FFFD Substitution in Conversion.
    2. (5.2) — If charT and path::value_type are the same, then s3 is the same as s2.
    3. (5.3) — Otherwise, s3 is the result of an implementation-defined transcoding of s2.
    Writes s3 into ctx.out(). Writes s into ctx.out(), adjusted according to the path-format-spec. If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If charT and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.

  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted path when charT and path::value_type differ and the ordinary literal encoding is not UTF-8