Issue 303: Bitset input operator underspecified

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of CD1 status.

303. Bitset input operator underspecified

Section: 22.9.4 [bitset.operators] Status: CD1 Submitter: Matt Austern Opened: 2001-02-05 Last modified: 2016-01-28

Priority: Not Prioritized

View all other issues in [bitset.operators].

View all issues with CD1 status.

Discussion:

In 23.3.5.3, we are told that bitset's input operator "Extracts up to N (single-byte) characters from is.", where is is a stream of type basic_istream<charT, traits>.

The standard does not say what it means to extract single byte characters from a stream whose character type, charT, is in general not a single-byte character type. Existing implementations differ.

A reasonable solution will probably involve widen() and/or narrow(), since they are the supplied mechanism for converting a single character between char and arbitrary charT.

Narrowing the input characters is not the same as widening the literals '0' and '1', because there may be some locales in which more than one wide character maps to the narrow character '0'. Narrowing means that alternate representations may be used for bitset input, widening means that they may not be.

Note that for numeric input, num_get<> (22.2.2.1.2/8) compares input characters to widened version of narrow character literals.

From Pete Becker, in c++std-lib-8224:

Different writing systems can have different representations for the digits that represent 0 and 1. For example, in the Unicode representation of the Devanagari script (used in many of the Indic languages) the digit 0 is 0x0966, and the digit 1 is 0x0967. Calling narrow would translate those into '0' and '1'. But Unicode also provides the ASCII values 0x0030 and 0x0031 for for the Latin representations of '0' and '1', as well as code points for the same numeric values in several other scripts (Tamil has no character for 0, but does have the digits 1-9), and any of these values would also be narrowed to '0' and '1'.

...

It's fairly common to intermix both native and Latin representations of numbers in a document. So I think the rule has to be that if a wide character represents a digit whose value is 0 then the bit should be cleared; if it represents a digit whose value is 1 then the bit should be set; otherwise throw an exception. So in a Devanagari locale, both 0x0966 and 0x0030 would clear the bit, and both 0x0967 and 0x0031 would set it. Widen can't do that. It would pick one of those two values, and exclude the other one.

From Jens Maurer, in c++std-lib-8233:

Whatever we decide, I would find it most surprising if bitset conversion worked differently from int conversion with regard to alternate local representations of numbers.

Thus, I think the options are:

Have a new defect issue for 22.2.2.1.2/8 so that it will require the use of narrow().

Have a defect issue for bitset() which describes clearly that widen() is to be used.

Proposed resolution:

Replace the first two sentences of paragraph 5 with:

Extracts up to N characters from is. Stores these characters in a temporary object str of type basic_string<charT, traits>, then evaluates the expression x = bitset<N>(str).

Replace the third bullet item in paragraph 5 with:

the next input character is neither is.widen(0) nor is.widen(1) (in which case the input character is not extracted).

Rationale:

Input for bitset should work the same way as numeric input. Using widen does mean that alternative digit representations will not be recognized, but this was a known consequence of the design choice.