<regex>: Mark more loops as simple
#5889
Open
+12
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes the test case in #997. But I'm not linking the issue for closure yet because the
regex_error(error_stack)is still thrown in too many cases.As explained in #5762, a simple loop satisfied all of these requirements before this PR:
Specifically, the repeated pattern could consist of the following expressions:
This PR removes the last of the listed requirements and weakens the second: Captures are now allowed, and some branching is now allowed if it can no longer be observed in the match state after each repetition. Specifically, the following additional expressions can now appear in a simple loop:
The following expressions remain forbidden in simple loops:
This is because these expressions are "too branchy". This is obvious for disjunctions and loops (as they create the possibility that the pattern can match strings of various lengths), but the reasoning is more subtle for lookahead assertions.
Obviously, positive and negative lookahead assertions can internally branch, but neither of them can make the repeated pattern match strings of various lengths. However, the captures in positive lookahead assertions do not have to appear in the same relative position to the string matched by the repetitions, which would take away an important optimization opportunity for simple loops. I consider positive lookahead assertions inside loops too rare to be worth this loss.
The captures inside a negative lookahead assertion, however, are never matched after the assertion has been processed, so while the assertion might internally branch, this doesn't make any difference in the position of the string matched by the repetition and the positions of any captures. This is why negative lookahead assertions may now appear in simple loops but positive lookahead assertions must not.
These choices settle that simple loops satisfy the following requirement:
We can immediately take advantage of these settled properties of simple loops in the matcher. Specifically, these properties mean that each repetition of a simple loop matches the same capturing groups in the same order (outside of a negative lookahead assertion). Any backreference NFA node referencing them can only appear after a capturing group in a regular expression, so the referenced capturing group must either always be matched or always be unmatched. As for capturing groups inside negative lookahead assertions, they are always unmatched at the end of a loop repetition. This means that it cannot be observed via backreferences whether these capturing groups have been reset or not before each repetition in ECMAScript, so we can remove the resets of captures at the start of repetitions for simple loops. (I had added them in #5456 to avoid limiting our options for extending the notion of simple loops to more cases, but we no longer need this now that the requirements are settled.)
This is a change with potential ABI impact, because marking more loops as simple changes the way they are processed by the matcher. For this reason, I ran all regex tests with the updated parser and the matcher included in MSVC Build Tools 14.50. All tests also passed in this configuration (minus the tests for matcher bugs that were only fixed after 14.50).
As a side effect, this change trades fewer stack overflows and fewer allocations against a theoretically larger runtime complexity for greedy loops in old matchers. But the limits on
regex_error(error_stack)are so low and the runtime benefit of fewer allocations is so high that I think this change actually improves performance in practice for inputs that didn't trigger aregex_error(error_stack)exception before. (We can actually observe such a practical runtime advantage in the regex_match benchmarks in #5865 for the current matcher: Even though greedya*should yield the more optimal matching strategy for a long sequence of a's compared to non-greedya*?, the costs of the additional allocations for the greedy processing are so high thata*?is still noticeably faster.)The runtime impact of this change is currently too small in the benchmarks to be visible. This will change in another PR that implements more optimizations for simple loops.