Issue 3941: §[atomics.order] inadvertently prohibits widespread implementation techniques

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of Open status.

3941. §[atomics.order] inadvertently prohibits widespread implementation techniques

Section: 32.5.4 [atomics.order] Status: Open Submitter: Hans Boehm Opened: 2023-05-27 Last modified: 2025-02-10

Priority: 3

View other active issues in [atomics.order].

View all other issues in [atomics.order].

View all issues with Open status.

Discussion:

[Discovered by Ori Lahav, lead author of Repairing Sequential Consistency in C/C++11, which P0668 more or less tried to implement. This description also includes observations from Olivier Giroux and David Goldblatt.]

Currently, we require that the sequential consistency order S is consistent with the extended coherence order, which includes "reads before" and "reads from". This may require sc reads to be consistently ordered even if they see non-sc writes. This is not enforced by some "fence-after-store"/"trailing fence" sc implementations, notably on X86. It may also be violated by compiler transformations that eliminate a sequentially consistent load after a store into the same location, e.g. on sequentially consistent hardware.

As Ori pointed out, this is exhibited by their (SB + rfis) example on x86 reproduced below. I would expect this also happens on other platforms. The (SB + rfis) example:

Thread 1: x.store(1, mo_release); r1 = x.load() /* sees 1 */; r2 = y.load() /* sees 0 */
Thread 2: y.store(1, mo_release); r1 = y.load() /* sees 1 */; r2 = x.load() /* sees 0 */

This is unintentionally and unexpectedly disallowed by the current standard. We can see this as follows:

We have the following coherence-ordered-before relationships:

By 3.1 in 32.5.4 [atomics.order] (reads-from): The store in each thread is coherence-ordered-before the immediately following load.
By 3.3 (reads-before): The final load in each thread is coherence-ordered before the store in the other thread.
By 3.4 (transitivity): The final load in each thread is coherence-ordered-before the initial load in the other thread.

On the other hand, the initial load in each thread is sequenced-before and thus strongly-happens-before 6.10.2.2 [intro.races] p12.1 the final load in that thread. Thus coherence-ordered-before union strongly-happens-before is cyclic. But the S relation 32.5.4 [atomics.order] p4 must order all the loads, consistent with both strongly-happens-before (p4) and coherence-ordered-before (p4.1), which is clearly impossible.

With the standard x86 implementation, no fences are inserted. Thread 1 and Thread 2 see the stores to x and y in inconsistent orders, which is allowed by TSO, since each thread can see its own write early. Thus the implementation allows the outcome in question, as expected.

The problem here is an error during the attempt to simplify and translate plv.mpi-sws.org/scfix/paper.pdf to standardese. There is no known issue with the underlying paper.

I have not yet wrapped my head around this sufficiently to be able to suggest a solution, or to understand how difficult that will be. This happened as a consequence of attempting to strengthen SC fences, and not fixing the Power compilation problem (P0668). So we could probably improve matters by reverting the strengthening of sc fences. Clearly, we'd prefer to avoid that. We could just adopt the formulation in the paper, though that seems even harder to motivate informally. Fundamentally, it feels like we inadvertently strengthened (and broke) the semantics of ordinary sc operations, when we intended to only affect fences. The original paper makes more of a distinction there. The question is how we can reformulate that suitably here.

[ Meta-observation from a small off-line discussion that may be relevant to how we phrase the resolution here: The fact that nobody noticed this for a very long time, and implementers were not bothered by it, suggests that the audience for this part of the standard is nearly empty. We conjecture that implementers actually rely on atomics mappings generated by memory model experts, who are more interested in formal models than standardese. A more formal description is likely to increase the size of the audience, and would definitely ease verification and reduce the probability of mistakes like this.]

[2023-06-01; Reflector poll]

Set priority to 3 after reflector poll. Send to SG1.

[2025-02-10; SG1 reviewed the issue in March 2024 ]

"we want to address LWG3941 in a way that preserves existing implementations like X86" - unanimous consent.

Proposed resolution: