Commit Graph

187 Commits

Author SHA1 Message Date
Andreas Kling
3efe8043f7 LibRegex: Optimize (^|literal) split prefixes
Patterns like `(?:^|;)\s*foo=...` can only start matching at input
start or at occurrences of the separator, but the generic
start-position loop still entered the VM at each byte and paid the
leading split/backtrack cost on every miss.

Teach the start-position analysis to recognize this `(^|literal)`
shape and jump straight to those candidate positions. Keep the
optimization narrow: wider literal sets would need a single-pass
scanner, and rescanning once per literal would make miss-heavy
alternations quadratic.

Add a LibRegex test for the cookie-style prefix. TestRegex still
passes, and a release js benchmark exercising this shape remains
fast.
2026-03-29 16:06:57 +02:00
Andreas Kling
a26a76e3ac LibRegex: Reject bare ranges in /v set operations
Intersection and subtraction chains in unicode sets mode were accepting
bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects
those forms; the range has to stay inside a nested class before it can
participate in a set operation.

Reject `ClassSetOperand::Range` when building `&&` or `--` expressions,
and extend the runtime regexp tests with the reported invalid patterns
plus an escaped-endpoint range case.
2026-03-27 17:32:19 +01:00
Andreas Kling
f627b7dcbb LibRegex: Respect V8 astral literal lastIndex behavior
Preserve V8's behavior for bare single-astral literals when a unicode
global search starts in the middle of a surrogate pair. We were
snapping that lastIndex back to the pair start unconditionally,
which let /😀/gu and /\u{1F600}/gu match where V8 returns null.

Expose that literal shape from LibRegex to LibJS and add runtime
coverage for the bare literal case alongside a grouped control.
2026-03-27 17:32:19 +01:00
Andreas Kling
29c2fb9574 LibRegex: Keep empty-match surrogate candidates
Track whether a pattern can match empty and only skip interior
surrogate positions when the matcher must consume input. This keeps
the unicode candidate scan fast for consuming searches without
dropping valid zero-width matches such as /\B/ and /(?!...)/ between
a surrogate pair's two code units.

Add runtime coverage for both global lastIndex searches and plain
exec() searches on zero-width unicode patterns.
2026-03-27 17:32:19 +01:00
Andreas Kling
b96872140e LibRegex: Fix backward greedy lookbehind backtracking
When a backward greedy loop backtracked toward the right edge of the
input, the optimized scan for a following Char instruction could stop
making progress at end of input and loop forever. This made patterns
like /(?<=a.?)/ hang on non-matching input.
2026-03-27 17:32:19 +01:00
Andreas Kling
a03d1f8a5f LibRegex: Fix greedy \w and \W ignore-case handling
Keep the greedy built-in class fast path aligned with the regular VM
matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i
wrongly applied Unicode ignore-case behavior in the optimized loop.
2026-03-27 17:32:19 +01:00
Andreas Kling
59125dc6b1 LibRegex: Fail fast when matches need a missing literal
Detect literal tails that lie on the linear success path and reject
matches early when that literal never appears in the remaining input.
This lets /(a+)+b/ fail quickly on long runs of a instead of spending
its backtracking budget proving that the missing b can never match.

Keep the tail analysis cheap while doing this. The new required-literal
hint reuses trailing-literal extraction, so rewrite the linear-tail
check to compute predecessor counts in one pass instead of rescanning
the whole program for every instruction. That keeps large regex parses,
including the large bytestring constructor test, fast.

Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which
should return null without throwing a backtrack-limit error.
2026-03-27 17:32:19 +01:00
Andreas Kling
6e83bf0301 LibRegex: Collapse simple a|a?-style disjunctions
Patterns like `^(a|a?)+$` build a split tree where each `a` can be
matched by either alternative. On short non-matching inputs that still
blows through the backtrack limit even though the two alternatives are
semantically equivalent to a single greedy optional matcher.

Detect the narrow case of two single-term alternatives that compile to
the same simple matcher, where one is required and the other is greedy
`?`, and compile them as that single optional term instead of emitting
a disjunction.

Add a String.prototype.match regression for
("a".repeat(25) + "b").match(/^(a|a?)+$/), which should return null
instead of throwing an InternalError.
2026-03-27 17:32:19 +01:00
Andreas Kling
e506dabc05 LibRegex: Fix greedy lookbehind backtracking boundaries
Backward Char instructions read the code point immediately to the left
of the current position, but the greedy loop backtracking optimization
was scanning for the next literal at the current position itself.
That meant a lookbehind like `(?<=h.*)THIS` never reconsidered the
boundary after `h`, so valid matches were missed.

When the quantified part was allowed to shrink to zero, as in the
reported `(?<!.*q.*?)(?<=h.*)THIS(?=.*!)` pattern, the same
backtracking bug could thrash badly enough to appear hung.

Fix the backward greedy-loop scan to test candidate boundaries against
the code units immediately to their left. Do the same for supplementary
characters by checking the surrogate pair ending at that boundary.

Add String.prototype.match regressions for both the simple greedy
lookbehind and the full reported pattern.
2026-03-27 17:32:19 +01:00
Andreas Kling
4f6be8ab5d LibRegex: Preserve captures when loops reject empty matches
RepeatMatcher retries a quantified atom with its own captures cleared,
but if an additional greedy iteration matches the empty string the
engine must fall back to the pre-iteration state. The fast VM path was
clearing capture registers after backtracking from ProgressCheck,
which meant the restored state from the previous successful iteration
was immediately wiped out.

That showed up with nested quantified captures like
"xyz123xyz".match(/((123)|(xyz)*)*/), where the final empty expansion
of the outer `*` discarded the last non-empty captures and returned
undefined for groups 1 and 4.

The same area also needs to track each zero-width-capable iteration's
start position explicitly. Initializing that state with ProgressCheck
stored the end of the previous repetition instead, which regressed
patterns like `/(a*)*/` by letting an empty iteration commit `""`
into the capture instead of falling back to the pre-iteration state
with an undefined capture.

Clear captures before backtracking from a rejected empty iteration,
and save iteration starts before entering quantified bodies so
ProgressCheck only decides whether that iteration made progress.

Add regressions for the reported nested quantified capture case and
for `/(a*)*/.exec("b")`, which should leave the capture undefined.
2026-03-27 17:32:19 +01:00
Andreas Kling
a08b334603 LibRegex: Remove stale GreedyLoop snapshot comment
Drop the outdated note that GreedyLoop backtracking does not need a
register snapshot.

The previous commit started snapshotting registers for the optimized
GreedyLoop and LazyLoop states so capture groups are restored correctly
across backtracking. Keeping the old comment would describe the
opposite of the code we now rely on.
2026-03-27 17:32:19 +01:00
Andreas Kling
a835a25a65 LibRegex: Restore capture state during loop backtracking
Snapshot registers for GreedyLoop and LazyLoop backtrack states so
failed alternatives cannot leak capture mutations into an older loop
choice point.

Before this change, those optimized states only restored the input
position and active modifiers. If a later branch changed capture
registers before failing, revisiting an earlier loop state reused
the stale captures instead of the state that was current when the
loop state was pushed.

That let /^(b+|a){1,2}?bc/ on "bbc" produce an invalid group 1 range
with start 2 and end 1, which later tripped UBSan while
RegExp.prototype.exec materialized the match result.

Add a RegExp.prototype.exec regression for this pattern so we keep
the expected ["bbc", "b"] result covered.
2026-03-27 17:32:19 +01:00
Andreas Kling
dc2e9bbe91 LibRegex: Avoid widening ASCII regex input
Teach the Rust matcher to execute directly on ASCII-backed input.

Make the VM and literal fast paths generic over an input trait so we
can monomorphize separate ASCII and WTF-16 execution paths without
duplicating the regex semantics. Add ASCII-specific FFI entry points
and have the C++ bridge dispatch to them whenever Utf16View carries
ASCII storage.

This removes the per-match widening step from the hot path for exec(),
test(), and find_all(), which is exactly where LibJS often hands us
pure ASCII strings in 8-bit form. Keep the compiled representation
and reported capture offsets in UTF-16 code units so the observable
JavaScript behavior stays unchanged.
2026-03-27 17:32:19 +01:00
Andreas Kling
d7bf9d3898 LibRegex: Remove the legacy C++ ECMA-262 engine
Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all
in-tree users compile and execute through `ECMAScriptRegex`.

Stop building the legacy engine, remove its source files and the
POSIX-only fuzzers that depended on it, and update the remaining
LibRegex tests to target the Rust-backed facade instead of the deleted
implementation. Clean up the last includes, comments, and helper paths
that only existed to support the old backend.

After this commit LibRegex has a single ECMAScript engine in-tree,
eliminating duplicated maintenance and unifying future regex work.
2026-03-27 17:32:19 +01:00
Andreas Kling
34d954e2d7 LibRegex: Add ECMAScriptRegex and migrate callers
Add `ECMAScriptRegex`, LibRegex's C++ facade for ECMAScript regexes.

The facade owns compilation, execution, captures, named groups, and
error translation for the Rust backend, which lets callers stop
depending on the legacy parser and matcher types directly. Use it in the
remaining non-LibJS callers: URLPattern, HTML input pattern handling,
and the places in LibHTTP that only needed token validation.

Where a full regex engine was unnecessary, replace those call sites with
direct character checks. Also update focused LibURL, LibHTTP, and WPT
coverage for the migrated callers and corrected surrogate handling.
2026-03-27 17:32:19 +01:00
Andreas Kling
66fb0a8394 LibRegex/Rust: Add the ECMA-262 regex engine
Add LibRegex's new Rust ECMAScript regular expression engine.

Replace the old parser's direct pattern-to-bytecode pipeline with a
split architecture: parse patterns into a lossless AST first, then
lower that AST into bytecode for a dedicated backtracking VM. Keep the
syntax tree as the place for validation, analysis, and optimization
instead of teaching every transformation to rewrite partially built
bytecode.

Specialize this backend for the job LibJS actually needs. The old C++
engine shared one generic parser and matcher stack across ECMA-262 and
POSIX modes and supported both byte-string and UTF-16 inputs. The new
engine focuses on ECMA-262 semantics on WTF-16 data, which lets it
model lone surrogates and other JavaScript-specific behavior directly
instead of carrying POSIX and multi-encoding constraints through the
whole implementation.

Fill in the ECMAScript features needed to replace the old engine for
real web workloads: Unicode properties and sets, lookahead and
lookbehind, named groups and backreferences, modifier groups, string
properties, large quantifiers, lone surrogates, and the parser and VM
corner cases those features exercise.

Reshape the runtime around compile-time pattern hints and a hotter VM
loop. Pre-resolve Unicode properties, derive first-character,
character-class, and simple-scan filters, extract safe trailing
literals for anchored patterns, add literal and literal-alternation
fast paths, and keep reusable scratch storage for registers,
backtracking state, and modifier stacks. Teach `find_all` to stay
inside one VM so global searches stop paying setup costs on every
match.

Make those shortcuts semantics-aware instead of merely fast. In Unicode
mode, do not use literal fast paths for lone surrogates, since
ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair.
Likewise, only derive end-anchor suffix hints when the suffix lies on
every path to `Match`, so lookarounds and disjunctions cannot skip into
a shared tail and produce false negatives.

This commit lands the Rust crate, the C++ wrapper, the build
integration, and the initial LibJS-side plumbing needed to exercise
the new engine under real RegExp callers before removing the legacy
backend.
2026-03-27 17:32:19 +01:00
Ali Mohammad Pur
b50c53106a LibRegex: Flatten bytecode when printing for debug 2026-03-22 12:24:31 +01:00
Ali Mohammad Pur
62348f1214 LibRegex: Replace virtual-inheritance dispatch with a switch interpreter 2026-03-20 16:10:25 -05:00
aplefull
6f1b7c8d50 LibRegex: Track optional capture groups for match_length_minimum
Backreferences can match the empty string when the referenced group
didn't participate in the match, so we shouldn't add their length to the
match_length_minimum, as it makes us skip valid matches.
2026-02-26 13:50:11 +01:00
aplefull
53a98f26d4 LibRegex: Exclude lookahead assertions from match_length_minimum
Lookaheads are zero-width assertions and should not affect the minimum
match length.
2026-02-26 13:50:11 +01:00
aplefull
db76c1e27c LibRegex: Account for case-insensitive matching in optimizer
Optimizer wasn't considering case-insensitive mode when checking for
overlap between the repeated element and what follows. So patterns like
`/a*A\d/i` failed to match because 'a' and 'A' weren't seen as
overlapping. We compare them in a case-insensitive way now, when i flag
is set.
2026-02-26 13:50:11 +01:00
aplefull
292c0cc486 LibRegex: Detect overlapping character classes and ranges in optimizer
`range_contains()` checked if an lhs_range was contained within the
query range, rather than checking for overlap. This caused patterns
like `/A*[A-Z]/` to fail matching "A" because the optimizer didn't
detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only
checked if two character classes were identical, not if they overlapped.
So patterns like `/\d*\w/` failed to match "1" because \d and \w were
not recognized as overlapping, even though all digits are word
characters.
2026-02-26 13:50:11 +01:00
aplefull
2ac99312b0 LibRegex: Restore parser state for incomplete \x and \u escapes
In non-Unicode mode, incomplete escape sequences like `\x0` or `\u00`
should be parsed as literal characters. `read_digits_as_string` consumed
hex digits but did not restore the parser position when fewer digits
than required were found, and `consume_escaped_code_point` did not
update `current_token` after falling back to literal 'u'.
2026-02-26 13:50:11 +01:00
aplefull
96d6dba37c LibRegex: Backtrack 2 characters for legacy octal escapes
When `\0` is followed by digits, we backtrack to parse it as a legacy
octal escape. We need to backtrack 2 characters, so
`parse_legacy_octal_escape` sees the leading `0` and can parse sequences
correctly.
2026-02-26 13:50:11 +01:00
Jelle Raaijmakers
f175a00003 AK: Add and use IdentityHashTraits<Integral>
These new traits are identical to `Traits<Integral>`, except that
calling `.hash()` will return the value itself instead of hashing it.
This should be used in cases where either the value is already a proper
hash, or using the value as a hash will yield "good enough" performance
in e.g. HashTable.

Types larger than 32 bits are folded in on themselves. Collision tests
on some popular hashing algorithms show that XOR folding slightly
increases the number of collisions, but this allows `IdentityHashTraits`
not to make any assumptions on which bits are the most relevant for the
final hash.
2026-02-24 13:24:58 +01:00
Jelle Raaijmakers
04719e6491 LibRegex: Use Fibonacci hashing for regex matches
Compared to the old boost-style combine, this reduces collisions by 36%
after folding for ranges SP=0..300, IP=0..750 and reps=0..50.
2026-02-24 13:24:58 +01:00
Jelle Raaijmakers
57a9c795d3 LibGfx+LibRegex: Prevent double hashing of values
We had three instances of `pair_int_hash()` being called with a value
that was pulled through `u32_hash()`, which is not necessary - both
arguments to `pair_int_hash()` will be properly hashed.
2026-02-23 16:44:07 +01:00
Ben Wiederhake
7fb7025d69 LibRegex: Remove unused header in Regex 2026-02-23 12:15:23 +01:00
Jelle Raaijmakers
1745926fc6 AK+Everywhere: Use MurmurHash3 for int/u64 hashing
Rework our hash functions a bit for significant better performance:

* Rename int_hash to u32_hash to mirror u64_hash.
* Make pair_int_hash call u64_hash instead of multiple u32_hash()es.
* Implement MurmurHash3's fmix32 and fmix64 for u32_hash and u64_hash.

On my machine, this speeds up u32_hash by 20%, u64_hash by ~290%, and
pair_int_hash by ~260%.

We lose the property that an input of 0 results in something that is not
0. I've experimented with an offset to both hash functions, but it
resulted in a measurable performance degradation for u64_hash. If
there's a good use case for 0 not to result in 0, we can always add in
that offset as a countermeasure in the future.
2026-02-20 22:47:24 +01:00
aplefull
aeec2c804c LibRegex: Implement Unicode case-insensitive matching
Previously, case-insensitive regex matching used ASCII-only case
conversion (to_ascii_lowercase) even for Unicode characters.

Now we implement Canonicalize abstract operation, so we can case-fold
Unicode characters properly during case-insensitive matching.
2026-02-16 07:51:00 -05:00
Ali Mohammad Pur
01be1ed583 LibRegex: Mark OpCode_classes with REGEX_API 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
6aba31ba13 LibRegex: Add some FileCheck-like tests to ensure opts don't break 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
fedf0f78ca LibRegex: Reject RSeekTo crossing the current-to-EOL boundary 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
f4d4bd9ed1 LibRegex: Ignore 'FailIfEmpty' in dot-star loop detection 2026-02-07 14:09:56 +01:00
mikiubo
5aaf08c7cf LibRegex: Make RegexDebug resilient to empty state vectors
Avoid crashing in RegexDebug when saved_positions or step_backs
are empty.
These cases are already handled correctly by the bytecode execution,
but the debug output assumed non-empty vectors.

Print a placeholder instead when no entries are present.
This fixes #7502.
2026-01-21 14:20:08 +01:00
aplefull
e4572aa9d7 LibRegex: Add support for regex modifiers
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
2026-01-16 15:00:00 +01:00
aplefull
6ce312e22f LibRegex: Prevent empty matches in optional quantifiers
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
2026-01-16 01:11:24 +01:00
mikiubo
535d2476a7 LibRegex: Implement proper lookbehind via new StepBack opcodes
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.

These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.

Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.

Partially fix issue: #3459
2026-01-11 23:24:49 +01:00
Jelle Raaijmakers
ae20ecf857 AK+Everywhere: Add Vector::contains(predicate) and use it
No functional changes.
2026-01-08 15:27:30 +00:00
Ali Mohammad Pur
2677338f43 LibRegex: Process RSeekTo candidates in the correct order 2026-01-07 00:14:02 +01:00
Ali Mohammad Pur
9668927dfc LibRegex: Don't generate duplicate results for /.*/ patterns
Since the code pattern may span multiple blocks, this can generate
duplicate results; keep the last one to avoid corrupting the bytecode.
2026-01-06 19:09:27 +01:00
Ali Mohammad Pur
363f1f6568 LibRegex: Correctly calculate ForkIf target offset in tree alternatives 2026-01-06 19:09:27 +01:00
Ali Mohammad Pur
41ce1023b8 LibRegex: Add default initialisers to ParserResult to make gcc happy 2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
fbd898fb54 LibRegex: Use nicer rewrite APIs where possible
Co-Authored-By: Hendiadyoin1 <leon.a@serenityos.org>
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
c1535ef65b LibRegex: Skip multi-op compare overhead when not necessary 2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
637d47ba30 LibRegex: Add an optimisation for replacing /.*x/ with a seek op
This will avoid some catastrophic backtracking by just skipping to 'x'.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
77d982d6fe LibRegex: Restore the pure substring search optimisation for u16view
ca2f0141f6 removed only the execution side
of this, which made it skip some optimisations for pure string searches.
This commit implements it properly for utf16 strings instead.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
e2c6918cdb LibRegex: Fuse consecutive single-char Compares into a String Compare
This avoids huge instruction decoding and dispatch overhead, ~40x
performance improvement for /(^|x)ppp/.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
9d49fafdbf LibRegex: Add an optimisation to skip forks that cannot produce a match
...and implement it for 'start of line' checks.
This makes patterns like /(^|x)ppp/ fork-free at runtime, ~30% perf
improvement for that pattern.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
0acac7f02b LibRegex: Split basic blocks at jump targets too 2026-01-05 18:22:11 +01:00