Previously, a character class containing any builtin (\d, \w, \s)
forced the compiler down the slow "complex class" path, which emits
a disjunction of alternatives and backtracks at runtime.
For non-unicode, non-unicode-sets, non-negated classes, \w and \d
can be inlined as their raw ASCII code-point ranges. The resulting
class stays on the fast path and compiles into a single sorted
CharClass instruction.
The unicode/unicode_sets and negation guards are required for
correctness: with the /u + /i flags, \w gains non-ASCII members
via case folding (e.g. U+017F, U+212A), and negated classes have
a separate, smarter compilation path.
Let's not have LibRegex be the home of LibUnicode FFI. Move these to
LibUnicode so that we can:
1. Use these helpers in other libraries more easily.
2. Swap out icu4c methods with icu4x methods all within LibUnicode.
Unrolling a bounded quantifier {min,max} into (max-min) optional Split
chains lets the backtracker explore O(2^n) paths, which quickly
exhausts the backtrack limit for large bounds.
Fix this by compiling the optional tail via a RepeatStart/RepeatCheck
counted loop when the atom is known to be non-zero-width. The loop
is safe to use without a progress check precisely because the atom
cannot match empty.
This required making atom_can_be_zero_width recursive into group bodies:
previously it conservatively returned true for all Group and
NonCapturingGroup atoms, so the non-zero-width guard could never fire
for grouped subexpressions.
The old lowering triggered "Regular expression backtrack limit exceeded"
for patterns like /'(?:\\(?:\r\n|[\s\S])|[^'\\\r\n]){0,32}'/, causing
inputs that should match normally (or return null) to throw instead.
Fixes syntax highlighting of the C++ API on https://blend2d.com
Switch to LibUnicode’s ICU-backed functions.
Keep the explicit checks for '$', '_', U+200C, and U+200D that
ECMAScript requires on top of the Unicode properties.
Add test coverage for both the newly accepted case
and regression guards for cases that must continue to work.
Use mimalloc for Ladybird-owned allocations without overriding malloc().
Route kmalloc(), kcalloc(), krealloc(), and kfree() through mimalloc,
and put the embedded Rust crates on the same allocator via a shared
shim in AK/kmalloc.cpp.
This also lets us drop kfree_sized(), since it no longer used its size
argument. StringData, Utf16StringData, JS object storage, Rust error
strings, and the CoreAudio playback helpers can all free their AK-backed
storage with plain kfree().
Sanitizer builds still use the system allocator. LeakSanitizer does not
reliably trace references stored in mimalloc-managed AK containers, so
static caches and other long-lived roots can look leaked. Pass the old
size into the Rust realloc shim so aligned fallback reallocations can
move posix_memalign-backed blocks safely.
Static builds still need a little linker help. macOS app binaries need
the Rust allocator entry points forced in from liblagom-ak.a, while
static ELF links can pull in identical allocator shim definitions from
multiple Rust staticlibs. Keep the Apple -u flags and allow those
duplicate shim symbols for LibJS and LibRegex links on Linux and BSD.
When a multi-digit decimal escape like \81 exceeds the total capture
group count in non-Unicode mode, the parser falls back to legacy octal
reinterpretation. However, digits '8' and '9' are not valid in octal
(base 8), so passing them to parse_legacy_octal() caused an unwrap()
panic on None from char::to_digit(8).
Treat '8' and '9' as literal characters in the fallback path, matching
the behavior already present for the non-backreference.
Unicode-set intersection and subtraction always lowered their
post-consumption checks as lookbehinds. That is correct while the
outer matcher runs forward, but inside lookbehind the consumed text
sits to the right of the current position, so the checks must flip
to lookahead instead. Because we always looked left, patterns like
`(?<=[[^A-Z]--[A-Z]])P{N}` and the reported fuzz case missed
matches whenever the character before the consumed one changed the
set-operation result.
Preserve the surrounding match direction when compiling those
checks, and add coverage for reduced subtraction and intersection
cases plus the original regression.
Negated unicode-set classes are only valid when every member is
single-code-point. We already rejected direct string-valued members
such as `q{ab}` and `p{RGI_Emoji_Flag_Sequence}` inside `[^...]`,
but nested class-set operands could still smuggle them through, so
patterns like `[^[[p{Emoji_Keycap_Sequence}]]]` and the reported
fuzzed literal compiled instead of throwing.
Validate nested class-set expressions after parsing and reject only the
negated `/v` classes whose resulting multi-code-point strings are still
non-empty. Track the exact string members contributed by string
literals, string properties, and nested classes so intersections and
subtractions can eliminate them before the negated-class check runs.
Add constructor and literal coverage for the reduced nested-string
cases, the original regression, and valid negated set operations that
remove every string member.
RegExpBuiltinExec used to snap any Unicode lastIndex that landed on a
low surrogate back to the start of the pair. That matched `/😀/u`,
but it skipped valid empty matches when the original low-surrogate
position was itself matchable, such as `/p{Script=Cyrillic}?(?<!\D)/v`
on `"A😘"` and the longer fuzzed global case.
Try the snapped position first, then retry the original lastIndex when
the snapped match fails. Only keep that second result when it is empty
at the original low-surrogate position, so consuming /u and /v matches
still cannot split a surrogate pair. In the Rust VM, treat backward
Unicode matches that start between surrogate halves as having no
complete code point to their left, which matches V8's lookbehind
behavior for those positions.
Add reduced coverage for both low-surrogate exec cases, the original
global match count regression, and the consuming-match retry regression.
Compile the synthetic assertion for negated classes in the same
direction as the surrounding matcher. We were hardcoding a
lookahead for `[^...]`, so lookbehind checked the wrong side of the
current position and missed valid `/v` matches such as
`(?<=[^\p{Emoji}])2`.
Apply the same fix to unicode set classes, since they use the same
negative-lookaround-plus-`AnyChar` lowering for complements. Add
reduced `RegExp.js` coverage for both `[^\p{Emoji}]` and
`[[^\p{Emoji}]]` in lookbehind, plus the original complex `/gv`
regression.
Repeated simple loops like "a+".repeat(100) compile to a chain of
greedy loop instructions. When one loop failed, the VM only knew how
to give back one character at a time unless the next instruction was a
literal char, so V8's regexp-fallback case ran into the backtrack
limit instead of finding the obvious match.
When a greedy simple loop is followed by more loops for the same
matcher, sum their minimum counts and backtrack far enough to leave the
missing suffix in one step. If that suffix is already available to the
right, still give back one character so the VM makes progress instead
of reusing the same greedy state forever.
The RegExp runtime test now covers the Chromium regexp-fallback case
through exec(), global exec(), and both replace() paths, plus bounded
same-matcher chains where the suffix minimum is partly missing or
already available.
The VM only marked patterns as anchored when the first real instruction
was AssertStart. That missed anchors hidden behind capture setup or a
leading positive lookahead, so patterns like /(^bar)/ and /(?=^bar)\w+/
fell back to whole-subject scanning.
Teach the hint analysis to see through the non-consuming wrappers we
emit ahead of a leading ^, but still run the literal prefilters before
anchored and sticky VM attempts. Missing required literals should stay
cheap no-matches instead of running the full backtracking VM and
raising the step limit.
The RegExp runtime test now covers the Chromium ascii-regexp-subject
case on a long ASCII input and anchored, sticky, and global no-match
cases where the required literal is absent.
Browsers clamp braced quantifier bounds above 2^31 - 1 before
checking whether {min,max} is in order. The parser still kept values
up to u32::MAX, so patterns like {2147483648,2147483647} were
rejected even though both bounds should collapse to the same limit.
Clamp parsed braced quantifier bounds to 2^31 - 1 as they are read.
This keeps the existing acceptance of huge exact and open-ended
quantifiers and makes the constructor and regex literal paths agree
with other engines on the out-of-order edge cases.
The RegExp runtime and syntax tests now cover accepted huge
quantifiers, clamped order validation, and huge literal forms. The
reported constructor and literal cases also match other engines.
Unicode set intersection and subtraction were compiled by matching one
operand and then checking the others with lookbehind. That let a
longer string operand reject a shorter match whenever the longer
string happened to end at the same position.
Group unicode set operands by exact match length and compile each
length class separately, longest first. This keeps longest-match
semantics for unions while making intersection and subtraction compare
only strings of the same length. The new RegExp runtime cases cover
both the reported [a-z]--\q{abc} regression and the related
intersection/subtraction mismatches, and they now agree with V8.
Reject surrogate pairs in named group names unless both halves come
from the same raw form. A literal surrogate half was being
normalized into \uXXXX before LibRegex parsed the pattern, which let
mixed literal and escaped forms sneak through.
Validate surrogate handling on the UTF-16 pattern before
normalization, but only treat \k<...> as a named backreference when
the parser would do that too. Legacy regexes without named groups
still use \k as an identity escape, so their literal text must not be
rejected by the pre-scan.
Add runtime and syntax tests for the mixed forms, the valid literal,
fixed-width, and braced escape cases, and the legacy \k literals.
Teach import_rust_crate() to track RustFFI.h as a real build output,
and teach the relevant Rust build scripts to rerun when their FFI
inputs change.
Also keep a copy of RustFFI.h in Cargo's own OUT_DIR and restore the
configured FFI output from that cached copy after cargo rustc runs.
This fixes the case where Ninja knows the header is missing, reruns
the custom command, and Cargo exits without rerunning build.rs
because the crate itself is already up to date.
When Cargo leaves multiple hashed build-script outputs behind, pick
the newest root-output before restoring RustFFI.h so we do not copy a
stale header after Rust-side API changes.
Finally, track the remaining Rust-side inputs that could leave build
artifacts stale: LibUnicode and LibJS now rerun build.rs when src/
changes, and the asmintgen rule now depends on Cargo.lock, the
BytecodeDef path dependency, and newly added Rust source files.
Sticky regular expressions were still using the generic forward-search
paths inside LibRegex and only enforcing the lastIndex check back in
LibJS after a match had already been found. That made tokenizer-style
sticky patterns spend most of their time scanning for later matches
that would be thrown away.
Route sticky exec() and test() through an anchored VM entry point that
runs exactly once at the requested start position while keeping the
existing literal-hint pruning. Add focused test-js coverage for sticky
literals, alternations, classes, quantifiers, and WebIDL-style token
patterns.
Analyze parsed patterns for literal substrings that every match
must contain and feed the longest one into the existing
required-literal hint. This lets the VM reject obvious misses
for optional-prefix extractors, literal-prefix parsers, and
common-substring alternations without adding more shape-specific
fast paths.
Use ASCII case-insensitive probes when the required literal comes
from a legacy `/i` pattern, and skip modifier groups that change
ignore-case so the hint does not over-constrain scoped flag
behavior. Keep the analysis within the same 64-code-unit budget
that the VM hint already uses so large fixed quantifiers and long
literal alternations do not blow up compile time.
Add TestRegex coverage for assignment extractors, ASCII ignore-
case prefilters, common-substring alternations, and the compile-
time regression shapes. TestRegex passes, the long-pattern js
probes behave correctly, and regex benchmark spot checks stay
fast.
Extract whole-pattern `\bliteral\b` cases outside Unicode
modes and handle them with literal search plus ASCII word-
boundary checks. This keeps the optimization narrow while
still covering the token-style patterns that show up often in
our regex workloads.
Keep Unicode word-boundary patterns on the VM path, since
`\b` semantics change there. Add tests for plain matching,
ASCII ignore-case matching, and the Unicode fallback case.
Extend the literal alternation fast path to cover ASCII-only
`/i` patterns outside Unicode modes. Search candidate start
positions with a folded first-code-unit set, then verify the
alternatives in source order to preserve ECMAScript semantics.
Keep Unicode ignore-case alternations on the VM path, since they
can match non-ASCII code points through canonicalization. Add
tests for mixed-prefix alternatives, source-order selection,
and the Unicode fallback behavior.
Whole-pattern ignore-case literals already bypass the VM, but the
search path still checked every start position linearly with full
case-fold comparisons. For ASCII literals we can cheaply scan for the
first code unit and only verify the rest at candidate positions.
This keeps the existing non-ASCII fallback intact while speeding up
common benchmark shapes such as `/puebzr/i`, `/zfvr/gi`, and
`/##yv22##/gi`. Add tests covering both alphabetic and punctuation-led
ASCII literals under ignore-case matching.
TestRegex passes. In Build/release js microbenchmarks, representative
cases dropped from about 47ms to 38ms, 95ms to 63ms, and 121ms to
79ms.
Patterns like `(?:^|;)\s*foo=...` can only start matching at input
start or at occurrences of the separator, but the generic
start-position loop still entered the VM at each byte and paid the
leading split/backtrack cost on every miss.
Teach the start-position analysis to recognize this `(^|literal)`
shape and jump straight to those candidate positions. Keep the
optimization narrow: wider literal sets would need a single-pass
scanner, and rescanning once per literal would make miss-heavy
alternations quadratic.
Add a LibRegex test for the cookie-style prefix. TestRegex still
passes, and a release js benchmark exercising this shape remains
fast.
Intersection and subtraction chains in unicode sets mode were accepting
bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects
those forms; the range has to stay inside a nested class before it can
participate in a set operation.
Reject `ClassSetOperand::Range` when building `&&` or `--` expressions,
and extend the runtime regexp tests with the reported invalid patterns
plus an escaped-endpoint range case.
Preserve V8's behavior for bare single-astral literals when a unicode
global search starts in the middle of a surrogate pair. We were
snapping that lastIndex back to the pair start unconditionally,
which let /😀/gu and /\u{1F600}/gu match where V8 returns null.
Expose that literal shape from LibRegex to LibJS and add runtime
coverage for the bare literal case alongside a grouped control.
Track whether a pattern can match empty and only skip interior
surrogate positions when the matcher must consume input. This keeps
the unicode candidate scan fast for consuming searches without
dropping valid zero-width matches such as /\B/ and /(?!...)/ between
a surrogate pair's two code units.
Add runtime coverage for both global lastIndex searches and plain
exec() searches on zero-width unicode patterns.
When a backward greedy loop backtracked toward the right edge of the
input, the optimized scan for a following Char instruction could stop
making progress at end of input and loop forever. This made patterns
like /(?<=a.?)/ hang on non-matching input.
Keep the greedy built-in class fast path aligned with the regular VM
matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i
wrongly applied Unicode ignore-case behavior in the optimized loop.
Detect literal tails that lie on the linear success path and reject
matches early when that literal never appears in the remaining input.
This lets /(a+)+b/ fail quickly on long runs of a instead of spending
its backtracking budget proving that the missing b can never match.
Keep the tail analysis cheap while doing this. The new required-literal
hint reuses trailing-literal extraction, so rewrite the linear-tail
check to compute predecessor counts in one pass instead of rescanning
the whole program for every instruction. That keeps large regex parses,
including the large bytestring constructor test, fast.
Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which
should return null without throwing a backtrack-limit error.
Patterns like `^(a|a?)+$` build a split tree where each `a` can be
matched by either alternative. On short non-matching inputs that still
blows through the backtrack limit even though the two alternatives are
semantically equivalent to a single greedy optional matcher.
Detect the narrow case of two single-term alternatives that compile to
the same simple matcher, where one is required and the other is greedy
`?`, and compile them as that single optional term instead of emitting
a disjunction.
Add a String.prototype.match regression for
("a".repeat(25) + "b").match(/^(a|a?)+$/), which should return null
instead of throwing an InternalError.
Backward Char instructions read the code point immediately to the left
of the current position, but the greedy loop backtracking optimization
was scanning for the next literal at the current position itself.
That meant a lookbehind like `(?<=h.*)THIS` never reconsidered the
boundary after `h`, so valid matches were missed.
When the quantified part was allowed to shrink to zero, as in the
reported `(?<!.*q.*?)(?<=h.*)THIS(?=.*!)` pattern, the same
backtracking bug could thrash badly enough to appear hung.
Fix the backward greedy-loop scan to test candidate boundaries against
the code units immediately to their left. Do the same for supplementary
characters by checking the surrogate pair ending at that boundary.
Add String.prototype.match regressions for both the simple greedy
lookbehind and the full reported pattern.
RepeatMatcher retries a quantified atom with its own captures cleared,
but if an additional greedy iteration matches the empty string the
engine must fall back to the pre-iteration state. The fast VM path was
clearing capture registers after backtracking from ProgressCheck,
which meant the restored state from the previous successful iteration
was immediately wiped out.
That showed up with nested quantified captures like
"xyz123xyz".match(/((123)|(xyz)*)*/), where the final empty expansion
of the outer `*` discarded the last non-empty captures and returned
undefined for groups 1 and 4.
The same area also needs to track each zero-width-capable iteration's
start position explicitly. Initializing that state with ProgressCheck
stored the end of the previous repetition instead, which regressed
patterns like `/(a*)*/` by letting an empty iteration commit `""`
into the capture instead of falling back to the pre-iteration state
with an undefined capture.
Clear captures before backtracking from a rejected empty iteration,
and save iteration starts before entering quantified bodies so
ProgressCheck only decides whether that iteration made progress.
Add regressions for the reported nested quantified capture case and
for `/(a*)*/.exec("b")`, which should leave the capture undefined.
Drop the outdated note that GreedyLoop backtracking does not need a
register snapshot.
The previous commit started snapshotting registers for the optimized
GreedyLoop and LazyLoop states so capture groups are restored correctly
across backtracking. Keeping the old comment would describe the
opposite of the code we now rely on.
Snapshot registers for GreedyLoop and LazyLoop backtrack states so
failed alternatives cannot leak capture mutations into an older loop
choice point.
Before this change, those optimized states only restored the input
position and active modifiers. If a later branch changed capture
registers before failing, revisiting an earlier loop state reused
the stale captures instead of the state that was current when the
loop state was pushed.
That let /^(b+|a){1,2}?bc/ on "bbc" produce an invalid group 1 range
with start 2 and end 1, which later tripped UBSan while
RegExp.prototype.exec materialized the match result.
Add a RegExp.prototype.exec regression for this pattern so we keep
the expected ["bbc", "b"] result covered.
Teach the Rust matcher to execute directly on ASCII-backed input.
Make the VM and literal fast paths generic over an input trait so we
can monomorphize separate ASCII and WTF-16 execution paths without
duplicating the regex semantics. Add ASCII-specific FFI entry points
and have the C++ bridge dispatch to them whenever Utf16View carries
ASCII storage.
This removes the per-match widening step from the hot path for exec(),
test(), and find_all(), which is exactly where LibJS often hands us
pure ASCII strings in 8-bit form. Keep the compiled representation
and reported capture offsets in UTF-16 code units so the observable
JavaScript behavior stays unchanged.
Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all
in-tree users compile and execute through `ECMAScriptRegex`.
Stop building the legacy engine, remove its source files and the
POSIX-only fuzzers that depended on it, and update the remaining
LibRegex tests to target the Rust-backed facade instead of the deleted
implementation. Clean up the last includes, comments, and helper paths
that only existed to support the old backend.
After this commit LibRegex has a single ECMAScript engine in-tree,
eliminating duplicated maintenance and unifying future regex work.
Add `ECMAScriptRegex`, LibRegex's C++ facade for ECMAScript regexes.
The facade owns compilation, execution, captures, named groups, and
error translation for the Rust backend, which lets callers stop
depending on the legacy parser and matcher types directly. Use it in the
remaining non-LibJS callers: URLPattern, HTML input pattern handling,
and the places in LibHTTP that only needed token validation.
Where a full regex engine was unnecessary, replace those call sites with
direct character checks. Also update focused LibURL, LibHTTP, and WPT
coverage for the migrated callers and corrected surrogate handling.
Add LibRegex's new Rust ECMAScript regular expression engine.
Replace the old parser's direct pattern-to-bytecode pipeline with a
split architecture: parse patterns into a lossless AST first, then
lower that AST into bytecode for a dedicated backtracking VM. Keep the
syntax tree as the place for validation, analysis, and optimization
instead of teaching every transformation to rewrite partially built
bytecode.
Specialize this backend for the job LibJS actually needs. The old C++
engine shared one generic parser and matcher stack across ECMA-262 and
POSIX modes and supported both byte-string and UTF-16 inputs. The new
engine focuses on ECMA-262 semantics on WTF-16 data, which lets it
model lone surrogates and other JavaScript-specific behavior directly
instead of carrying POSIX and multi-encoding constraints through the
whole implementation.
Fill in the ECMAScript features needed to replace the old engine for
real web workloads: Unicode properties and sets, lookahead and
lookbehind, named groups and backreferences, modifier groups, string
properties, large quantifiers, lone surrogates, and the parser and VM
corner cases those features exercise.
Reshape the runtime around compile-time pattern hints and a hotter VM
loop. Pre-resolve Unicode properties, derive first-character,
character-class, and simple-scan filters, extract safe trailing
literals for anchored patterns, add literal and literal-alternation
fast paths, and keep reusable scratch storage for registers,
backtracking state, and modifier stacks. Teach `find_all` to stay
inside one VM so global searches stop paying setup costs on every
match.
Make those shortcuts semantics-aware instead of merely fast. In Unicode
mode, do not use literal fast paths for lone surrogates, since
ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair.
Likewise, only derive end-anchor suffix hints when the suffix lies on
every path to `Match`, so lookarounds and disjunctions cannot skip into
a shared tail and produce false negatives.
This commit lands the Rust crate, the C++ wrapper, the build
integration, and the initial LibJS-side plumbing needed to exercise
the new engine under real RegExp callers before removing the legacy
backend.
Backreferences can match the empty string when the referenced group
didn't participate in the match, so we shouldn't add their length to the
match_length_minimum, as it makes us skip valid matches.
Optimizer wasn't considering case-insensitive mode when checking for
overlap between the repeated element and what follows. So patterns like
`/a*A\d/i` failed to match because 'a' and 'A' weren't seen as
overlapping. We compare them in a case-insensitive way now, when i flag
is set.
`range_contains()` checked if an lhs_range was contained within the
query range, rather than checking for overlap. This caused patterns
like `/A*[A-Z]/` to fail matching "A" because the optimizer didn't
detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only
checked if two character classes were identical, not if they overlapped.
So patterns like `/\d*\w/` failed to match "1" because \d and \w were
not recognized as overlapping, even though all digits are word
characters.
In non-Unicode mode, incomplete escape sequences like `\x0` or `\u00`
should be parsed as literal characters. `read_digits_as_string` consumed
hex digits but did not restore the parser position when fewer digits
than required were found, and `consume_escaped_code_point` did not
update `current_token` after falling back to literal 'u'.
When `\0` is followed by digits, we backtrack to parse it as a legacy
octal escape. We need to backtrack 2 characters, so
`parse_legacy_octal_escape` sees the leading `0` and can parse sequences
correctly.
These new traits are identical to `Traits<Integral>`, except that
calling `.hash()` will return the value itself instead of hashing it.
This should be used in cases where either the value is already a proper
hash, or using the value as a hash will yield "good enough" performance
in e.g. HashTable.
Types larger than 32 bits are folded in on themselves. Collision tests
on some popular hashing algorithms show that XOR folding slightly
increases the number of collisions, but this allows `IdentityHashTraits`
not to make any assumptions on which bits are the most relevant for the
final hash.