Patterns like `(?:^|;)\s*foo=...` can only start matching at input
start or at occurrences of the separator, but the generic
start-position loop still entered the VM at each byte and paid the
leading split/backtrack cost on every miss.
Teach the start-position analysis to recognize this `(^|literal)`
shape and jump straight to those candidate positions. Keep the
optimization narrow: wider literal sets would need a single-pass
scanner, and rescanning once per literal would make miss-heavy
alternations quadratic.
Add a LibRegex test for the cookie-style prefix. TestRegex still
passes, and a release js benchmark exercising this shape remains
fast.
Intersection and subtraction chains in unicode sets mode were accepting
bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects
those forms; the range has to stay inside a nested class before it can
participate in a set operation.
Reject `ClassSetOperand::Range` when building `&&` or `--` expressions,
and extend the runtime regexp tests with the reported invalid patterns
plus an escaped-endpoint range case.
Preserve V8's behavior for bare single-astral literals when a unicode
global search starts in the middle of a surrogate pair. We were
snapping that lastIndex back to the pair start unconditionally,
which let /😀/gu and /\u{1F600}/gu match where V8 returns null.
Expose that literal shape from LibRegex to LibJS and add runtime
coverage for the bare literal case alongside a grouped control.
Track whether a pattern can match empty and only skip interior
surrogate positions when the matcher must consume input. This keeps
the unicode candidate scan fast for consuming searches without
dropping valid zero-width matches such as /\B/ and /(?!...)/ between
a surrogate pair's two code units.
Add runtime coverage for both global lastIndex searches and plain
exec() searches on zero-width unicode patterns.
When a backward greedy loop backtracked toward the right edge of the
input, the optimized scan for a following Char instruction could stop
making progress at end of input and loop forever. This made patterns
like /(?<=a.?)/ hang on non-matching input.
Keep the greedy built-in class fast path aligned with the regular VM
matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i
wrongly applied Unicode ignore-case behavior in the optimized loop.
Detect literal tails that lie on the linear success path and reject
matches early when that literal never appears in the remaining input.
This lets /(a+)+b/ fail quickly on long runs of a instead of spending
its backtracking budget proving that the missing b can never match.
Keep the tail analysis cheap while doing this. The new required-literal
hint reuses trailing-literal extraction, so rewrite the linear-tail
check to compute predecessor counts in one pass instead of rescanning
the whole program for every instruction. That keeps large regex parses,
including the large bytestring constructor test, fast.
Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which
should return null without throwing a backtrack-limit error.
Patterns like `^(a|a?)+$` build a split tree where each `a` can be
matched by either alternative. On short non-matching inputs that still
blows through the backtrack limit even though the two alternatives are
semantically equivalent to a single greedy optional matcher.
Detect the narrow case of two single-term alternatives that compile to
the same simple matcher, where one is required and the other is greedy
`?`, and compile them as that single optional term instead of emitting
a disjunction.
Add a String.prototype.match regression for
("a".repeat(25) + "b").match(/^(a|a?)+$/), which should return null
instead of throwing an InternalError.
Backward Char instructions read the code point immediately to the left
of the current position, but the greedy loop backtracking optimization
was scanning for the next literal at the current position itself.
That meant a lookbehind like `(?<=h.*)THIS` never reconsidered the
boundary after `h`, so valid matches were missed.
When the quantified part was allowed to shrink to zero, as in the
reported `(?<!.*q.*?)(?<=h.*)THIS(?=.*!)` pattern, the same
backtracking bug could thrash badly enough to appear hung.
Fix the backward greedy-loop scan to test candidate boundaries against
the code units immediately to their left. Do the same for supplementary
characters by checking the surrogate pair ending at that boundary.
Add String.prototype.match regressions for both the simple greedy
lookbehind and the full reported pattern.
RepeatMatcher retries a quantified atom with its own captures cleared,
but if an additional greedy iteration matches the empty string the
engine must fall back to the pre-iteration state. The fast VM path was
clearing capture registers after backtracking from ProgressCheck,
which meant the restored state from the previous successful iteration
was immediately wiped out.
That showed up with nested quantified captures like
"xyz123xyz".match(/((123)|(xyz)*)*/), where the final empty expansion
of the outer `*` discarded the last non-empty captures and returned
undefined for groups 1 and 4.
The same area also needs to track each zero-width-capable iteration's
start position explicitly. Initializing that state with ProgressCheck
stored the end of the previous repetition instead, which regressed
patterns like `/(a*)*/` by letting an empty iteration commit `""`
into the capture instead of falling back to the pre-iteration state
with an undefined capture.
Clear captures before backtracking from a rejected empty iteration,
and save iteration starts before entering quantified bodies so
ProgressCheck only decides whether that iteration made progress.
Add regressions for the reported nested quantified capture case and
for `/(a*)*/.exec("b")`, which should leave the capture undefined.
Drop the outdated note that GreedyLoop backtracking does not need a
register snapshot.
The previous commit started snapshotting registers for the optimized
GreedyLoop and LazyLoop states so capture groups are restored correctly
across backtracking. Keeping the old comment would describe the
opposite of the code we now rely on.
Snapshot registers for GreedyLoop and LazyLoop backtrack states so
failed alternatives cannot leak capture mutations into an older loop
choice point.
Before this change, those optimized states only restored the input
position and active modifiers. If a later branch changed capture
registers before failing, revisiting an earlier loop state reused
the stale captures instead of the state that was current when the
loop state was pushed.
That let /^(b+|a){1,2}?bc/ on "bbc" produce an invalid group 1 range
with start 2 and end 1, which later tripped UBSan while
RegExp.prototype.exec materialized the match result.
Add a RegExp.prototype.exec regression for this pattern so we keep
the expected ["bbc", "b"] result covered.
Teach the Rust matcher to execute directly on ASCII-backed input.
Make the VM and literal fast paths generic over an input trait so we
can monomorphize separate ASCII and WTF-16 execution paths without
duplicating the regex semantics. Add ASCII-specific FFI entry points
and have the C++ bridge dispatch to them whenever Utf16View carries
ASCII storage.
This removes the per-match widening step from the hot path for exec(),
test(), and find_all(), which is exactly where LibJS often hands us
pure ASCII strings in 8-bit form. Keep the compiled representation
and reported capture offsets in UTF-16 code units so the observable
JavaScript behavior stays unchanged.
Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all
in-tree users compile and execute through `ECMAScriptRegex`.
Stop building the legacy engine, remove its source files and the
POSIX-only fuzzers that depended on it, and update the remaining
LibRegex tests to target the Rust-backed facade instead of the deleted
implementation. Clean up the last includes, comments, and helper paths
that only existed to support the old backend.
After this commit LibRegex has a single ECMAScript engine in-tree,
eliminating duplicated maintenance and unifying future regex work.
Add `ECMAScriptRegex`, LibRegex's C++ facade for ECMAScript regexes.
The facade owns compilation, execution, captures, named groups, and
error translation for the Rust backend, which lets callers stop
depending on the legacy parser and matcher types directly. Use it in the
remaining non-LibJS callers: URLPattern, HTML input pattern handling,
and the places in LibHTTP that only needed token validation.
Where a full regex engine was unnecessary, replace those call sites with
direct character checks. Also update focused LibURL, LibHTTP, and WPT
coverage for the migrated callers and corrected surrogate handling.
Add LibRegex's new Rust ECMAScript regular expression engine.
Replace the old parser's direct pattern-to-bytecode pipeline with a
split architecture: parse patterns into a lossless AST first, then
lower that AST into bytecode for a dedicated backtracking VM. Keep the
syntax tree as the place for validation, analysis, and optimization
instead of teaching every transformation to rewrite partially built
bytecode.
Specialize this backend for the job LibJS actually needs. The old C++
engine shared one generic parser and matcher stack across ECMA-262 and
POSIX modes and supported both byte-string and UTF-16 inputs. The new
engine focuses on ECMA-262 semantics on WTF-16 data, which lets it
model lone surrogates and other JavaScript-specific behavior directly
instead of carrying POSIX and multi-encoding constraints through the
whole implementation.
Fill in the ECMAScript features needed to replace the old engine for
real web workloads: Unicode properties and sets, lookahead and
lookbehind, named groups and backreferences, modifier groups, string
properties, large quantifiers, lone surrogates, and the parser and VM
corner cases those features exercise.
Reshape the runtime around compile-time pattern hints and a hotter VM
loop. Pre-resolve Unicode properties, derive first-character,
character-class, and simple-scan filters, extract safe trailing
literals for anchored patterns, add literal and literal-alternation
fast paths, and keep reusable scratch storage for registers,
backtracking state, and modifier stacks. Teach `find_all` to stay
inside one VM so global searches stop paying setup costs on every
match.
Make those shortcuts semantics-aware instead of merely fast. In Unicode
mode, do not use literal fast paths for lone surrogates, since
ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair.
Likewise, only derive end-anchor suffix hints when the suffix lies on
every path to `Match`, so lookarounds and disjunctions cannot skip into
a shared tail and produce false negatives.
This commit lands the Rust crate, the C++ wrapper, the build
integration, and the initial LibJS-side plumbing needed to exercise
the new engine under real RegExp callers before removing the legacy
backend.
Backreferences can match the empty string when the referenced group
didn't participate in the match, so we shouldn't add their length to the
match_length_minimum, as it makes us skip valid matches.
Optimizer wasn't considering case-insensitive mode when checking for
overlap between the repeated element and what follows. So patterns like
`/a*A\d/i` failed to match because 'a' and 'A' weren't seen as
overlapping. We compare them in a case-insensitive way now, when i flag
is set.
`range_contains()` checked if an lhs_range was contained within the
query range, rather than checking for overlap. This caused patterns
like `/A*[A-Z]/` to fail matching "A" because the optimizer didn't
detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only
checked if two character classes were identical, not if they overlapped.
So patterns like `/\d*\w/` failed to match "1" because \d and \w were
not recognized as overlapping, even though all digits are word
characters.
In non-Unicode mode, incomplete escape sequences like `\x0` or `\u00`
should be parsed as literal characters. `read_digits_as_string` consumed
hex digits but did not restore the parser position when fewer digits
than required were found, and `consume_escaped_code_point` did not
update `current_token` after falling back to literal 'u'.
When `\0` is followed by digits, we backtrack to parse it as a legacy
octal escape. We need to backtrack 2 characters, so
`parse_legacy_octal_escape` sees the leading `0` and can parse sequences
correctly.
These new traits are identical to `Traits<Integral>`, except that
calling `.hash()` will return the value itself instead of hashing it.
This should be used in cases where either the value is already a proper
hash, or using the value as a hash will yield "good enough" performance
in e.g. HashTable.
Types larger than 32 bits are folded in on themselves. Collision tests
on some popular hashing algorithms show that XOR folding slightly
increases the number of collisions, but this allows `IdentityHashTraits`
not to make any assumptions on which bits are the most relevant for the
final hash.
We had three instances of `pair_int_hash()` being called with a value
that was pulled through `u32_hash()`, which is not necessary - both
arguments to `pair_int_hash()` will be properly hashed.
Rework our hash functions a bit for significant better performance:
* Rename int_hash to u32_hash to mirror u64_hash.
* Make pair_int_hash call u64_hash instead of multiple u32_hash()es.
* Implement MurmurHash3's fmix32 and fmix64 for u32_hash and u64_hash.
On my machine, this speeds up u32_hash by 20%, u64_hash by ~290%, and
pair_int_hash by ~260%.
We lose the property that an input of 0 results in something that is not
0. I've experimented with an offset to both hash functions, but it
resulted in a measurable performance degradation for u64_hash. If
there's a good use case for 0 not to result in 0, we can always add in
that offset as a countermeasure in the future.
Previously, case-insensitive regex matching used ASCII-only case
conversion (to_ascii_lowercase) even for Unicode characters.
Now we implement Canonicalize abstract operation, so we can case-fold
Unicode characters properly during case-insensitive matching.
Avoid crashing in RegexDebug when saved_positions or step_backs
are empty.
These cases are already handled correctly by the bytecode execution,
but the debug output assumed non-empty vectors.
Print a placeholder instead when no entries are present.
This fixes#7502.
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.
These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.
Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.
Partially fix issue: #3459
ca2f0141f6 removed only the execution side
of this, which made it skip some optimisations for pure string searches.
This commit implements it properly for utf16 strings instead.