Issue 291: Underspecification of set algorithms

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of CD1 status.

291. Underspecification of set algorithms

Section: 26.8.7 [alg.set.operations] Status: CD1 Submitter: Matt Austern Opened: 2001-01-03 Last modified: 2016-01-28

Priority: Not Prioritized

View all other issues in [alg.set.operations].

View all issues with CD1 status.

Discussion:

The standard library contains four algorithms that compute set operations on sorted ranges: set_union, set_intersection, set_difference, and set_symmetric_difference. Each of these algorithms takes two sorted ranges as inputs, and writes the output of the appropriate set operation to an output range. The elements in the output range are sorted.

The ordinary mathematical definitions are generalized so that they apply to ranges containing multiple copies of a given element. Two elements are considered to be "the same" if, according to an ordering relation provided by the user, neither one is less than the other. So, for example, if one input range contains five copies of an element and another contains three, the output range of set_union will contain five copies, the output range of set_intersection will contain three, the output range of set_difference will contain two, and the output range of set_symmetric_difference will contain two.

Because two elements can be "the same" for the purposes of these set algorithms, without being identical in other respects (consider, for example, strings under case-insensitive comparison), this raises a number of unanswered questions:

If we're copying an element that's present in both of the input ranges, which one do we copy it from?
If there are n copies of an element in the relevant input range, and the output range will contain fewer copies (say m) which ones do we choose? The first m, or the last m, or something else?
Are these operations stable? That is, does a run of equivalent elements appear in the output range in the same order as as it appeared in the input range(s)?

The standard should either answer these questions, or explicitly say that the answers are unspecified. I prefer the former option, since, as far as I know, all existing implementations behave the same way.

Proposed resolution:

Add the following to the end of 26.8.7.3 [set.union] paragraph 5:

If [first1, last1) contains m elements that are equivalent to each other and [first2, last2) contains n elements that are equivalent to them, then max(m, n) of these elements will be copied to the output range: all m of these elements from [first1, last1), and the last max(n-m, 0) of them from [first2, last2), in that order.

Add the following to the end of 26.8.7.4 [set.intersection] paragraph 5:

If [first1, last1) contains m elements that are equivalent to each other and [first2, last2) contains n elements that are equivalent to them, the first min(m, n) of those elements from [first1, last1) are copied to the output range.

Add a new paragraph, Notes, after 26.8.7.5 [set.difference] paragraph 4:

If [first1, last1) contains m elements that are equivalent to each other and [first2, last2) contains n elements that are equivalent to them, the last max(m-n, 0) elements from [first1, last1) are copied to the output range.

Add a new paragraph, Notes, after 26.8.7.6 [set.symmetric.difference] paragraph 4:

If [first1, last1) contains m elements that are equivalent to each other and [first2, last2) contains n elements that are equivalent to them, then |m - n| of those elements will be copied to the output range: the last m - n of these elements from [first1, last1) if m > n, and the last n - m of these elements from [first2, last2) if m < n.

[Santa Cruz: it's believed that this language is clearer than what's in the Standard. However, it's also believed that the Standard may already make these guarantees (although not quite in these words). Bill and Howard will check and see whether they think that some or all of these changes may be redundant. If so, we may close this issue as NAD.]

Rationale:

For simple cases, these descriptions are equivalent to what's already in the Standard. For more complicated cases, they describe the behavior of existing implementations.