Return JS::Substring objects from the builtin regexp exec and split
paths instead of eagerly copying UTF-16 slices into new strings.
Matches, captures, and split pieces can now point back at the original
input until someone asks for the string contents.
Add focused runtime coverage for UTF-16 captures and regex split
captures so these lazy slices stay exercised.
Detect literal tails that lie on the linear success path and reject
matches early when that literal never appears in the remaining input.
This lets /(a+)+b/ fail quickly on long runs of a instead of spending
its backtracking budget proving that the missing b can never match.
Keep the tail analysis cheap while doing this. The new required-literal
hint reuses trailing-literal extraction, so rewrite the linear-tail
check to compute predecessor counts in one pass instead of rescanning
the whole program for every instruction. That keeps large regex parses,
including the large bytestring constructor test, fast.
Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which
should return null without throwing a backtrack-limit error.
Patterns like `^(a|a?)+$` build a split tree where each `a` can be
matched by either alternative. On short non-matching inputs that still
blows through the backtrack limit even though the two alternatives are
semantically equivalent to a single greedy optional matcher.
Detect the narrow case of two single-term alternatives that compile to
the same simple matcher, where one is required and the other is greedy
`?`, and compile them as that single optional term instead of emitting
a disjunction.
Add a String.prototype.match regression for
("a".repeat(25) + "b").match(/^(a|a?)+$/), which should return null
instead of throwing an InternalError.
Backward Char instructions read the code point immediately to the left
of the current position, but the greedy loop backtracking optimization
was scanning for the next literal at the current position itself.
That meant a lookbehind like `(?<=h.*)THIS` never reconsidered the
boundary after `h`, so valid matches were missed.
When the quantified part was allowed to shrink to zero, as in the
reported `(?<!.*q.*?)(?<=h.*)THIS(?=.*!)` pattern, the same
backtracking bug could thrash badly enough to appear hung.
Fix the backward greedy-loop scan to test candidate boundaries against
the code units immediately to their left. Do the same for supplementary
characters by checking the surrogate pair ending at that boundary.
Add String.prototype.match regressions for both the simple greedy
lookbehind and the full reported pattern.
RepeatMatcher retries a quantified atom with its own captures cleared,
but if an additional greedy iteration matches the empty string the
engine must fall back to the pre-iteration state. The fast VM path was
clearing capture registers after backtracking from ProgressCheck,
which meant the restored state from the previous successful iteration
was immediately wiped out.
That showed up with nested quantified captures like
"xyz123xyz".match(/((123)|(xyz)*)*/), where the final empty expansion
of the outer `*` discarded the last non-empty captures and returned
undefined for groups 1 and 4.
The same area also needs to track each zero-width-capable iteration's
start position explicitly. Initializing that state with ProgressCheck
stored the end of the previous repetition instead, which regressed
patterns like `/(a*)*/` by letting an empty iteration commit `""`
into the capture instead of falling back to the pre-iteration state
with an undefined capture.
Clear captures before backtracking from a rejected empty iteration,
and save iteration starts before entering quantified bodies so
ProgressCheck only decides whether that iteration made progress.
Add regressions for the reported nested quantified capture case and
for `/(a*)*/.exec("b")`, which should leave the capture undefined.
Switch LibJS `RegExp` over to the Rust-backed `ECMAScriptRegex` APIs.
Route `new RegExp()`, regex literals, and the RegExp builtins through
the new compile and exec APIs, and stop re-validating patterns with the
deleted C++ parser on the way in. Preserve the observable error
behavior by carrying structured compile errors and backtracking-limit
failures across the FFI boundary. Cache compiled regex state and named
capture metadata on `RegExpObject` in the new representation.
Use the new API surface to simplify and speed up the builtin paths too:
share `exec_internal`, cache compiled regex pointers, keep the legacy
RegExp statics lazy, run global replace through batch `find_all`, and
optimize replace, test, split, and String helper paths. Add regression
tests for those JavaScript-visible paths.