Issue 76: Can a codecvt facet always convert one internal character at a time?

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of CD1 status.

76. Can a `codecvt` facet always convert one internal character at a time?

Section: 28.3.4.2.5 [locale.codecvt] Status: CD1 Submitter: Matt Austern Opened: 1998-09-25 Last modified: 2016-01-28

Priority: Not Prioritized

View all other issues in [locale.codecvt].

View all issues with CD1 status.

Discussion:

This issue concerns the requirements on classes derived from codecvt, including user-defined classes. What are the restrictions on the conversion from external characters (e.g. char) to internal characters (e.g. wchar_t)? Or, alternatively, what assumptions about codecvt facets can the I/O library make?

The question is whether it's possible to convert from internal characters to external characters one internal character at a time, and whether, given a valid sequence of external characters, it's possible to pick off internal characters one at a time. Or, to put it differently: given a sequence of external characters and the corresponding sequence of internal characters, does a position in the internal sequence correspond to some position in the external sequence?

To make this concrete, suppose that [first, last) is a sequence of M external characters and that [ifirst, ilast) is the corresponding sequence of N internal characters, where N > 1. That is, my_encoding.in(), applied to [first, last), yields [ifirst, ilast). Now the question: does there necessarily exist a subsequence of external characters, [first, last_1), such that the corresponding sequence of internal characters is the single character *ifirst?

(What a "no" answer would mean is that my_encoding translates sequences only as blocks. There's a sequence of M external characters that maps to a sequence of N internal characters, but that external sequence has no subsequence that maps to N-1 internal characters.)

Some of the wording in the standard, such as the description of codecvt::do_max_length (28.3.4.2.5.3 [locale.codecvt.virtuals], paragraph 11) and basic_filebuf::underflow (31.10.3.5 [filebuf.virtuals], paragraph 3) suggests that it must always be possible to pick off internal characters one at a time from a sequence of external characters. However, this is never explicitly stated one way or the other.

This issue seems (and is) quite technical, but it is important if we expect users to provide their own encoding facets. This is an area where the standard library calls user-supplied code, so a well-defined set of requirements for the user-supplied code is crucial. Users must be aware of the assumptions that the library makes. This issue affects positioning operations on basic_filebuf, unbuffered input, and several of codecvt's member functions.

Proposed resolution:

Add the following text as a new paragraph, following 28.3.4.2.5.3 [locale.codecvt.virtuals] paragraph 2:

A codecvt facet that is used by basic_filebuf (31.10 [file.streams]) must have the property that if
    do_out(state, from, from_end, from_next, to, to_lim, to_next)
would return ok, where from != from_end, then
    do_out(state, from, from + 1, from_next, to, to_end, to_next)
must also return ok, and that if
    do_in(state, from, from_end, from_next, to, to_lim, to_next)
would return ok, where to != to_lim, then
    do_in(state, from, from_end, from_next, to, to + 1, to_next)
must also return ok. [Footnote: Informally, this means that basic_filebuf assumes that the mapping from internal to external characters is 1 to N: a codecvt that is used by basic_filebuf must be able to translate characters one internal character at a time. --End Footnote]

[Redmond: Minor change in proposed resolution. Original proposed resolution talked about "success", with a parenthetical comment that success meant returning ok. New wording removes all talk about "success", and just talks about the return value.]

Rationale:

The proposed resoluion says that conversions can be performed one internal character at a time. This rules out some encodings that would otherwise be legal. The alternative answer would mean there would be some internal positions that do not correspond to any external file position.

An example of an encoding that this rules out is one where the internT and externT are of the same type, and where the internal sequence c1 c2 corresponds to the external sequence c2 c1.

It was generally agreed that basic_filebuf relies on this property: it was designed under the assumption that the external-to-internal mapping is N-to-1, and it is not clear that basic_filebuf is implementable without that restriction.

The proposed resolution is expressed as a restriction on codecvt when used by basic_filebuf, rather than a blanket restriction on all codecvt facets, because basic_filebuf is the only other part of the library that uses codecvt. If a user wants to define a codecvt facet that implements a more general N-to-M mapping, there is no reason to prohibit it, so long as the user does not expect basic_filebuf to be able to use it.

76. Can a codecvt facet always convert one internal character at a time?

76. Can a `codecvt` facet always convert one internal character at a time?