Issue 305: Default behavior of codecvt<wchar_t, char, mbstate

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of CD1 status.

305. Default behavior of codecvt<wchar_t, char, mbstate_t>::length()

Section: 28.3.4.2.6 [locale.codecvt.byname] Status: CD1 Submitter: Howard Hinnant Opened: 2001-01-24 Last modified: 2016-01-28

Priority: Not Prioritized

View all other issues in [locale.codecvt.byname].

View all issues with CD1 status.

Discussion:

22.2.1.5/3 introduces codecvt in part with:

codecvt<wchar_t,char,mbstate_t> converts between the native character sets for tiny and wide characters. Instantiations on mbstate_t perform conversion between encodings known to the library implementor.

But 22.2.1.5.2/10 describes do_length in part with:

... codecvt<wchar_t, char, mbstate_t> ... return(s) the lesser of max and (from_end-from).

The semantics of do_in and do_length are linked. What one does must be consistent with what the other does. 22.2.1.5/3 leads me to believe that the vendor is allowed to choose the algorithm that codecvt<wchar_t,char,mbstate_t>::do_in performs so that it makes his customers happy on a given platform. But 22.2.1.5.2/10 explicitly says what codecvt<wchar_t,char,mbstate_t>::do_length must return. And thus indirectly specifies the algorithm that codecvt<wchar_t,char,mbstate_t>::do_in must perform. I believe that this is not what was intended and is a defect.

Discussion from the -lib reflector:
This proposal would have the effect of making the semantics of all of the virtual functions in codecvt<wchar_t, char, mbstate_t> implementation specified. Is that what we want, or do we want to mandate specific behavior for the base class virtuals and leave the implementation specified behavior for the codecvt_byname derived class? The tradeoff is that former allows implementors to write a base class that actually does something useful, while the latter gives users a way to get known and specified---albeit useless---behavior, and is consistent with the way the standard handles other facets. It is not clear what the original intention was.

Nathan has suggest a compromise: a character that is a widened version of the characters in the basic execution character set must be converted to a one-byte sequence, but there is no such requirement for characters that are not part of the basic execution character set.

Proposed resolution:

Change 22.2.1.5.2/5 from:

The instantiations required in Table 51 (lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and codecvt<char,char,mbstate_t>, store no characters. Stores no more than (to_limit-to) destination elements. It always leaves the to_next pointer pointing one beyond the last element successfully stored.

to:

Stores no more than (to_limit-to) destination elements, and leaves the to_next pointer pointing one beyond the last element successfully stored. codecvt<char,char,mbstate_t> stores no characters.

Change 22.2.1.5.2/10 from:

-10- Returns: (from_next-from) where from_next is the largest value in the range [from,from_end] such that the sequence of values in the range [from,from_next) represents max or fewer valid complete characters of type internT. The instantiations required in Table 51 (21.1.1.1.1), namely codecvt<wchar_t, char, mbstate_t> and codecvt<char, char, mbstate_t>, return the lesser of max and (from_end-from).

to:

-10- Returns: (from_next-from) where from_next is the largest value in the range [from,from_end] such that the sequence of values in the range [from,from_next) represents max or fewer valid complete characters of type internT. The instantiation codecvt<char, char, mbstate_t> returns the lesser of max and (from_end-from).

[Redmond: Nathan suggested an alternative resolution: same as above, but require that, in the default encoding, a character from the basic execution character set would map to a single external character. The straw poll was 8-1 in favor of the proposed resolution.]

Rationale:

The default encoding should be whatever users of a given platform would expect to be the most natural. This varies from platform to platform. In many cases there is a preexisting C library, and users would expect the default encoding to be whatever C uses in the default "C" locale. We could impose a guarantee like the one Nathan suggested (a character from the basic execution character set must map to a single external character), but this would rule out important encodings that are in common use: it would rule out JIS, for example, and it would rule out a fixed-width encoding of UCS-4.

[Curaçao: fixed rationale typo at the request of Ichiro Koshida; "shift-JIS" changed to "JIS".]