This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of Resolved status.

3698. regex_iterator and join_view don't work together very well

Section: 32.11 [re.iter], 26.7.14 [range.join] Status: Resolved Submitter: Barry Revzin Opened: 2022-05-12 Last modified: 2023-03-23

Priority: 2

View all other issues in [re.iter].

View all issues with Resolved status.

Discussion:

Consider this example (from StackOverflow):

#include <ranges>
#include <regex>
#include <iostream>

int main() {
  char const text[] = "Hello";
  std::regex regex{"[a-z]"};

  auto lower = std::ranges::subrange(
        std::cregex_iterator(
            std::ranges::begin(text),
            std::ranges::end(text),
            regex),
        std::cregex_iterator{}
    )
    | std::views::join
    | std::views::transform([](auto const& sm) {
        return std::string_view(sm.first, sm.second);
    });

  for (auto const& sv : lower) {
    std::cout << sv << '\n';
  }
}

This example seems sound, having lower be a range of string_view that should refer back into text, which is in scope for all this time. The std::regex object is also in scope for all this time.

Yet, if run this through address sanitizer, this blows up in the first call to the dereference operator of the underlying transform_view's iterator with heap-use-after-free.

The problem here is ultimately that regex_iterator is a stashing iterator (it has a member match_results) yet advertises itself as a forward_iterator (despite violating 25.3.5.5 [forward.iterators] p6 and 25.3.4.11 [iterator.concept.forward] p3.

Then, join_view's iterator stores an outer iterator (the regex_iterator) and an inner_iterator (an iterator into the container that the regex_iterator stashes). Copying that iterator effectively invalidates it — since the new iterator's inner iterator will refer to the old iterator's outer iterator's container. These aren't (and can't be) independent copies. In this particular example, join_view's begin iterator is copied into the transform_view's iterator, and then the original is destroyed (which owns the container that the new inner iterator still points to), which causes us to have a dangling iterator.

Note that the example is well-formed in libc++ because libc++ moves instead of copying an iterator, which happens to work. But I can produce other non-transform-view related examples that fail.

This is actually two different problems:

  1. regex_iterator is really an input iterator, not a forward iterator. It does not meet either the C++17 or the C++20 forward iterator requirements.

  2. join_view can't handle stashing iterators, and would need to additionally store the outer iterator in a non-propagating-cache for input ranges (similar to how it already potentially stores the inner iterator in a non-propagating-cache).

(So potentially this could be two different LWG issues, but it seems nicer to think of them together.)

[2022-05-17; Reflector poll]

Set priority to 2 after reflector poll.

[Kona 2022-11-08; Move to Open]

Tim to write a paper

[2023-01-16; Tim comments]

The paper P2770R0 is provided with proposed wording.

[2023-03-22 Resolved by the adoption of P2770R0 in Issaquah. Status changed: Open → Resolved.]

Proposed resolution: