Skip to content

Conversation

@muellerj2
Copy link
Contributor

This fixes the test case in #997. But I'm not linking the issue for closure yet because the regex_error(error_stack) is still thrown in too many cases.

As explained in #5762, a simple loop satisfied all of these requirements before this PR:

  • Simple loops are non-reentrant, i.e., they are entered at most once in each trajectory.
  • The repeated pattern is branchless (no _Do_if node, no character class matching collating elements of various sizes).
  • The repeated pattern always matches strings of the same length (if it matches).
  • The repeated pattern does not introduce new captures.

Specifically, the repeated pattern could consist of the following expressions:

  • Single characters (escaped or not).
  • Character classes always matching strings of the same length, including the dot. (Character classes can potentially match strings of various lengths due to collating elements.)
  • Backreferences.

This PR removes the last of the listed requirements and weakens the second: Captures are now allowed, and some branching is now allowed if it can no longer be observed in the match state after each repetition. Specifically, the following additional expressions can now appear in a simple loop:

  • Sequences of characters (some potentially escaped).
  • Capturing and non-capturing groups.
  • Assertions: carets, anchors, word bounds and negative lookahead assertions.

The following expressions remain forbidden in simple loops:

  • Disjunctions.
  • Loops.
  • Positive lookahead assertions.

This is because these expressions are "too branchy". This is obvious for disjunctions and loops (as they create the possibility that the pattern can match strings of various lengths), but the reasoning is more subtle for lookahead assertions.

Obviously, positive and negative lookahead assertions can internally branch, but neither of them can make the repeated pattern match strings of various lengths. However, the captures in positive lookahead assertions do not have to appear in the same relative position to the string matched by the repetitions, which would take away an important optimization opportunity for simple loops. I consider positive lookahead assertions inside loops too rare to be worth this loss.

The captures inside a negative lookahead assertion, however, are never matched after the assertion has been processed, so while the assertion might internally branch, this doesn't make any difference in the position of the string matched by the repetition and the positions of any captures. This is why negative lookahead assertions may now appear in simple loops but positive lookahead assertions must not.

These choices settle that simple loops satisfy the following requirement:

  • After each repetition, the capturing groups inside the repeated pattern are always matched to the same position relative to the start of the string matched by the repeated pattern in a repetition.

We can immediately take advantage of these settled properties of simple loops in the matcher. Specifically, these properties mean that each repetition of a simple loop matches the same capturing groups in the same order (outside of a negative lookahead assertion). Any backreference NFA node referencing them can only appear after a capturing group in a regular expression, so the referenced capturing group must either always be matched or always be unmatched. As for capturing groups inside negative lookahead assertions, they are always unmatched at the end of a loop repetition. This means that it cannot be observed via backreferences whether these capturing groups have been reset or not before each repetition in ECMAScript, so we can remove the resets of captures at the start of repetitions for simple loops. (I had added them in #5456 to avoid limiting our options for extending the notion of simple loops to more cases, but we no longer need this now that the requirements are settled.)

This is a change with potential ABI impact, because marking more loops as simple changes the way they are processed by the matcher. For this reason, I ran all regex tests with the updated parser and the matcher included in MSVC Build Tools 14.50. All tests also passed in this configuration (minus the tests for matcher bugs that were only fixed after 14.50).

As a side effect, this change trades fewer stack overflows and fewer allocations against a theoretically larger runtime complexity for greedy loops in old matchers. But the limits on regex_error(error_stack) are so low and the runtime benefit of fewer allocations is so high that I think this change actually improves performance in practice for inputs that didn't trigger a regex_error(error_stack) exception before. (We can actually observe such a practical runtime advantage in the regex_match benchmarks in #5865 for the current matcher: Even though greedy a* should yield the more optimal matching strategy for a long sequence of a's compared to non-greedy a*?, the costs of the additional allocations for the greedy processing are so high that a*? is still noticeably faster.)

The runtime impact of this change is currently too small in the benchmarks to be visible. This will change in another PR that implements more optimizations for simple loops.

@muellerj2 muellerj2 requested a review from a team as a code owner November 20, 2025 21:50
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Nov 20, 2025
@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex meow is a substring of homeowner labels Nov 20, 2025
@StephanTLavavej StephanTLavavej self-assigned this Nov 20, 2025
@StephanTLavavej
Copy link
Member

Thank you as always for the exceptionally clear writeup and careful changes! 😻 💯

@StephanTLavavej StephanTLavavej removed their assignment Nov 23, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Something can be improved regex meow is a substring of homeowner

Projects

Status: Ready To Merge

Development

Successfully merging this pull request may close these issues.

2 participants