This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of CD1 status.
codecvt
facet always convert one internal character at a time?Section: 28.3.4.2.5 [locale.codecvt] Status: CD1 Submitter: Matt Austern Opened: 1998-09-25 Last modified: 2016-01-28
Priority: Not Prioritized
View all other issues in [locale.codecvt].
View all issues with CD1 status.
Discussion:
This issue concerns the requirements on classes derived from
codecvt
, including user-defined classes. What are the
restrictions on the conversion from external characters
(e.g. char
) to internal characters (e.g. wchar_t
)?
Or, alternatively, what assumptions about codecvt
facets can
the I/O library make?
The question is whether it's possible to convert from internal characters to external characters one internal character at a time, and whether, given a valid sequence of external characters, it's possible to pick off internal characters one at a time. Or, to put it differently: given a sequence of external characters and the corresponding sequence of internal characters, does a position in the internal sequence correspond to some position in the external sequence?
To make this concrete, suppose that [first, last)
is a
sequence of M external characters and that [ifirst,
ilast)
is the corresponding sequence of N internal
characters, where N > 1. That is, my_encoding.in()
,
applied to [first, last)
, yields [ifirst,
ilast)
. Now the question: does there necessarily exist a
subsequence of external characters, [first, last_1)
, such
that the corresponding sequence of internal characters is the single
character *ifirst
?
(What a "no" answer would mean is that
my_encoding
translates sequences only as blocks. There's a
sequence of M external characters that maps to a sequence of
N internal characters, but that external sequence has no
subsequence that maps to N-1 internal characters.)
Some of the wording in the standard, such as the description of
codecvt::do_max_length
(28.3.4.2.5.3 [locale.codecvt.virtuals],
paragraph 11) and basic_filebuf::underflow
(31.10.3.5 [filebuf.virtuals], paragraph 3) suggests that it must always be
possible to pick off internal characters one at a time from a sequence
of external characters. However, this is never explicitly stated one
way or the other.
This issue seems (and is) quite technical, but it is important if
we expect users to provide their own encoding facets. This is an area
where the standard library calls user-supplied code, so a well-defined
set of requirements for the user-supplied code is crucial. Users must
be aware of the assumptions that the library makes. This issue affects
positioning operations on basic_filebuf
, unbuffered input,
and several of codecvt
's member functions.
Proposed resolution:
Add the following text as a new paragraph, following 28.3.4.2.5.3 [locale.codecvt.virtuals] paragraph 2:
A
codecvt
facet that is used bybasic_filebuf
(31.10 [file.streams]) must have the property that ifdo_out(state, from, from_end, from_next, to, to_lim, to_next)would return
ok
, wherefrom != from_end
, thendo_out(state, from, from + 1, from_next, to, to_end, to_next)must also return
ok
, and that ifdo_in(state, from, from_end, from_next, to, to_lim, to_next)would return
ok
, whereto != to_lim
, thendo_in(state, from, from_end, from_next, to, to + 1, to_next)must also return
ok
. [Footnote: Informally, this means thatbasic_filebuf
assumes that the mapping from internal to external characters is 1 to N: acodecvt
that is used bybasic_filebuf
must be able to translate characters one internal character at a time. --End Footnote]
[Redmond: Minor change in proposed resolution. Original
proposed resolution talked about "success", with a parenthetical
comment that success meant returning ok
. New wording
removes all talk about "success", and just talks about the
return value.]
Rationale:
The proposed resoluion says that conversions can be performed one internal character at a time. This rules out some encodings that would otherwise be legal. The alternative answer would mean there would be some internal positions that do not correspond to any external file position.
An example of an encoding that this rules out is one where the
internT
and externT
are of the same type, and
where the internal sequence c1 c2
corresponds to the
external sequence c2 c1
.
It was generally agreed that basic_filebuf
relies
on this property: it was designed under the assumption that
the external-to-internal mapping is N-to-1, and it is not clear
that basic_filebuf
is implementable without that
restriction.
The proposed resolution is expressed as a restriction on
codecvt
when used by basic_filebuf
, rather
than a blanket restriction on all codecvt
facets,
because basic_filebuf
is the only other part of the
library that uses codecvt
. If a user wants to define
a codecvt
facet that implements a more general N-to-M
mapping, there is no reason to prohibit it, so long as the user
does not expect basic_filebuf
to be able to use it.