Issue 4070: Transcoding by std::formatter<std::filesystem::path>

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of SG16 status.

4070. Transcoding by `std::formatter<std::filesystem::path>`

Section: 31.12.6.9.2 [fs.path.fmtr.funcs] Status: SG16 Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2025-06-16

Priority: 2

View all issues with SG16 status.

Discussion:

31.12.6.9.2 [fs.path.fmtr.funcs] says:

If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the ? option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per 28.5.6.5 [format.string.escaped]:

Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence \x{hex-digit-sequence}, where hex-digit-sequence is the shortest hexadecimal representation of U using lower-case hexadecimal digits.

So only unescaped strings can have ill-formed sequences by the time we do transcoding to char, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as \x{hex-digit-sequence}. So an ill-formed sequence of two wchar_t values will be escaped as two \x{...} strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as \x{fffd}. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

[2024-05-08; Reflector poll]

Set priority to 2 after reflector poll.

Previous resolution [SUPERSEDED]:

This wording is relative to N4981.
Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:
template<class FormatContext>
  typename FormatContext::iterator
    format(const filesystem::path& p, FormatContext& ctx) const;
-5- Effects: Let s be p.generic_string<filesystem::path::value_type>() if the g option is used, otherwise p.native(). Writes s into ctx.out(), adjusted according to the path-format-spec. If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the ~~escaped path~~ (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If charT and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
Modify the entry in the index of implementation-defined behavior as indicated:
transcoding of a formatted path when charT and path::value_type differ and not converting from wchar_t to UTF-8

[2025-06-11; SG16 comments and improves wording]

The "and not converting from wchar_t to UTF-8" wording added in the index of implementation-defined behavior by the current proposed resolution should be changed to "and the literal encoding is not UTF-8".

It was noted that "the literal encoding" is ambiguous in both the normative wording in 31.12.6.9.2 [fs.path.fmtr.funcs] p5 and in the new wording quoted above. In both cases, the intent is to refer to the "ordinary literal encoding". However, some SG16 participants were reluctant to include a drive-by fix with the proposed resolution for this issue since the ambiguous literal encoding reference i s a pre-existing and separable issue. Those same SG16 participants were more concerned that the same wording was used in both 31.12.6.9.2 [fs.path.fmtr.funcs] p5 and in the corresponding entry of the implementation-defined behavior index. I would defer to the LWG chair to decide whether to address this as an additional related clarification with this change or as a separate editorial or LWG issue.

The minimal change is to replace "and not converting from wchar_t to UTF-8" with "and the literal encoding is not UTF-8". The optional change is to insert "ordinary" before "literal encoding" as well. Once that is done, I'll have SG16 confirm they are content with the new proposed resolution.

Proposed resolution:

This wording is relative to N5008.

Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:
```
template<class FormatContext>
  typename FormatContext::iterator
    format(const filesystem::path& p, FormatContext& ctx) const;
```
-5- Effects: Let s be p.generic_string<filesystem::path::value_type>() if the g option is used, otherwise p.native(). Writes s into ctx.out(), adjusted according to the path-format-spec. If charT is char, path::value_type is wchar_t, and the ordinary literal encoding is UTF-8, then the ~~escaped path~~ (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If charT and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
Modify the entry in the index of implementation-defined behavior as indicated:
transcoding of a formatted path when charT and path::value_type differ and the ordinary literal encoding is not UTF-8

4070. Transcoding by std::formatter<std::filesystem::path>

4070. Transcoding by `std::formatter<std::filesystem::path>`