This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of CD1 status.
Section: 30.4.2.6 [locale.codecvt.byname] Status: CD1 Submitter: Howard Hinnant Opened: 2001-01-24 Last modified: 2016-01-28
Priority: Not Prioritized
View all other issues in [locale.codecvt.byname].
View all issues with CD1 status.
Discussion:
22.2.1.5/3 introduces codecvt in part with:
codecvt<wchar_t,char,mbstate_t> converts between the native character sets for tiny and wide characters. Instantiations on mbstate_t perform conversion between encodings known to the library implementor.
But 22.2.1.5.2/10 describes do_length in part with:
... codecvt<wchar_t, char, mbstate_t> ... return(s) the lesser of max and (from_end-from).
The semantics of do_in and do_length are linked. What one does must be consistent with what the other does. 22.2.1.5/3 leads me to believe that the vendor is allowed to choose the algorithm that codecvt<wchar_t,char,mbstate_t>::do_in performs so that it makes his customers happy on a given platform. But 22.2.1.5.2/10 explicitly says what codecvt<wchar_t,char,mbstate_t>::do_length must return. And thus indirectly specifies the algorithm that codecvt<wchar_t,char,mbstate_t>::do_in must perform. I believe that this is not what was intended and is a defect.
Discussion from the -lib reflector:
This proposal would have the effect of making the semantics of
all of the virtual functions in codecvt<wchar_t, char,
mbstate_t>
implementation specified. Is that what we want, or
do we want to mandate specific behavior for the base class virtuals
and leave the implementation specified behavior for the codecvt_byname
derived class? The tradeoff is that former allows implementors to
write a base class that actually does something useful, while the
latter gives users a way to get known and specified---albeit
useless---behavior, and is consistent with the way the standard
handles other facets. It is not clear what the original intention
was.
Nathan has suggest a compromise: a character that is a widened version of the characters in the basic execution character set must be converted to a one-byte sequence, but there is no such requirement for characters that are not part of the basic execution character set.
Proposed resolution:
Change 22.2.1.5.2/5 from:
The instantiations required in Table 51 (lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and codecvt<char,char,mbstate_t>, store no characters. Stores no more than (to_limit-to) destination elements. It always leaves the to_next pointer pointing one beyond the last element successfully stored.
to:
Stores no more than (to_limit-to) destination elements, and leaves the to_next pointer pointing one beyond the last element successfully stored. codecvt<char,char,mbstate_t> stores no characters.
Change 22.2.1.5.2/10 from:
-10- Returns: (from_next-from) where from_next is the largest value in the range [from,from_end] such that the sequence of values in the range [from,from_next) represents max or fewer valid complete characters of type internT. The instantiations required in Table 51 (21.1.1.1.1), namely codecvt<wchar_t, char, mbstate_t> and codecvt<char, char, mbstate_t>, return the lesser of max and (from_end-from).
to:
-10- Returns: (from_next-from) where from_next is the largest value in the range [from,from_end] such that the sequence of values in the range [from,from_next) represents max or fewer valid complete characters of type internT. The instantiation codecvt<char, char, mbstate_t> returns the lesser of max and (from_end-from).
[Redmond: Nathan suggested an alternative resolution: same as above, but require that, in the default encoding, a character from the basic execution character set would map to a single external character. The straw poll was 8-1 in favor of the proposed resolution.]
Rationale:
The default encoding should be whatever users of a given platform would expect to be the most natural. This varies from platform to platform. In many cases there is a preexisting C library, and users would expect the default encoding to be whatever C uses in the default "C" locale. We could impose a guarantee like the one Nathan suggested (a character from the basic execution character set must map to a single external character), but this would rule out important encodings that are in common use: it would rule out JIS, for example, and it would rule out a fixed-width encoding of UCS-4.
[Curaçao: fixed rationale typo at the request of Ichiro Koshida; "shift-JIS" changed to "JIS".]