176 Commits

Author SHA1 Message Date
Aliaksandr Kalenik
7685a5e14a LibRegex: Inline \w and \d ranges in legacy positive char classes
Previously, a character class containing any builtin (\d, \w, \s)
forced the compiler down the slow "complex class" path, which emits
a disjunction of alternatives and backtracks at runtime.

For non-unicode, non-unicode-sets, non-negated classes, \w and \d
can be inlined as their raw ASCII code-point ranges. The resulting
class stays on the fast path and compiles into a single sorted
CharClass instruction.

The unicode/unicode_sets and negation guards are required for
correctness: with the /u + /i flags, \w gains non-ASCII members
via case folding (e.g. U+017F, U+212A), and negated classes have
a separate, smarter compilation path.
2026-04-21 16:36:38 +02:00
Andreas Kling
e4008393d3 LibRegex: Extract broader required-literal prefilters
Analyze parsed patterns for literal substrings that every match
must contain and feed the longest one into the existing
required-literal hint. This lets the VM reject obvious misses
for optional-prefix extractors, literal-prefix parsers, and
common-substring alternations without adding more shape-specific
fast paths.

Use ASCII case-insensitive probes when the required literal comes
from a legacy `/i` pattern, and skip modifier groups that change
ignore-case so the hint does not over-constrain scoped flag
behavior. Keep the analysis within the same 64-code-unit budget
that the VM hint already uses so large fixed quantifiers and long
literal alternations do not blow up compile time.

Add TestRegex coverage for assignment extractors, ASCII ignore-
case prefilters, common-substring alternations, and the compile-
time regression shapes. TestRegex passes, the long-pattern js
probes behave correctly, and regex benchmark spot checks stay
fast.
2026-03-29 16:06:57 +02:00
Andreas Kling
0f46413baa LibRegex: Speed up \bliteral\b fast paths
Extract whole-pattern `\bliteral\b` cases outside Unicode
modes and handle them with literal search plus ASCII word-
boundary checks. This keeps the optimization narrow while
still covering the token-style patterns that show up often in
our regex workloads.

Keep Unicode word-boundary patterns on the VM path, since
`\b` semantics change there. Add tests for plain matching,
ASCII ignore-case matching, and the Unicode fallback case.
2026-03-29 16:06:57 +02:00
Andreas Kling
5b8af39167 LibRegex: Speed up ASCII ignore-case literal alternations
Extend the literal alternation fast path to cover ASCII-only
`/i` patterns outside Unicode modes. Search candidate start
positions with a folded first-code-unit set, then verify the
alternatives in source order to preserve ECMAScript semantics.

Keep Unicode ignore-case alternations on the VM path, since they
can match non-ASCII code points through canonicalization. Add
tests for mixed-prefix alternatives, source-order selection,
and the Unicode fallback behavior.
2026-03-29 16:06:57 +02:00
Andreas Kling
72a3dce507 LibRegex: Speed up ASCII ignore-case literal search
Whole-pattern ignore-case literals already bypass the VM, but the
search path still checked every start position linearly with full
case-fold comparisons. For ASCII literals we can cheaply scan for the
first code unit and only verify the rest at candidate positions.

This keeps the existing non-ASCII fallback intact while speeding up
common benchmark shapes such as `/puebzr/i`, `/zfvr/gi`, and
`/##yv22##/gi`. Add tests covering both alphabetic and punctuation-led
ASCII literals under ignore-case matching.

TestRegex passes. In Build/release js microbenchmarks, representative
cases dropped from about 47ms to 38ms, 95ms to 63ms, and 121ms to
79ms.
2026-03-29 16:06:57 +02:00
Andreas Kling
3efe8043f7 LibRegex: Optimize (^|literal) split prefixes
Patterns like `(?:^|;)\s*foo=...` can only start matching at input
start or at occurrences of the separator, but the generic
start-position loop still entered the VM at each byte and paid the
leading split/backtrack cost on every miss.

Teach the start-position analysis to recognize this `(^|literal)`
shape and jump straight to those candidate positions. Keep the
optimization narrow: wider literal sets would need a single-pass
scanner, and rescanning once per literal would make miss-heavy
alternations quadratic.

Add a LibRegex test for the cookie-style prefix. TestRegex still
passes, and a release js benchmark exercising this shape remains
fast.
2026-03-29 16:06:57 +02:00
Andreas Kling
d6b8a5c872 LibRegex: Restore broad ECMA-262 test coverage
Port a broad slice of the old TestRegex.cpp coverage to the new
ECMAScriptRegex API so we keep exercising the Rust engine with the
same kinds of parser and matcher edge cases that used to live in the
legacy C++ suite.

Restore checks for parser acceptance and rejection, lookbehind
captures, quantified alternations, zero-width backreferences,
undefined backreferences, optional groups with empty matches,
modifier groups, Unicode properties and Unicode sets, empty-match
loops, and long fork chains.

Leave out the old Posix tests and bytecode-dump assertions that were
specific to the removed engine internals. One old boolean expectation
for duplicate named groups also did not translate cleanly to the new
search-style API, so keep the surrounding duplicate-name coverage in
capture-focused tests instead.

Verify the port with ./Build/sanitizers/bin/TestRegex.
2026-03-27 17:32:19 +01:00
Andreas Kling
dc2e9bbe91 LibRegex: Avoid widening ASCII regex input
Teach the Rust matcher to execute directly on ASCII-backed input.

Make the VM and literal fast paths generic over an input trait so we
can monomorphize separate ASCII and WTF-16 execution paths without
duplicating the regex semantics. Add ASCII-specific FFI entry points
and have the C++ bridge dispatch to them whenever Utf16View carries
ASCII storage.

This removes the per-match widening step from the hot path for exec(),
test(), and find_all(), which is exactly where LibJS often hands us
pure ASCII strings in 8-bit form. Keep the compiled representation
and reported capture offsets in UTF-16 code units so the observable
JavaScript behavior stays unchanged.
2026-03-27 17:32:19 +01:00
Andreas Kling
d7bf9d3898 LibRegex: Remove the legacy C++ ECMA-262 engine
Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all
in-tree users compile and execute through `ECMAScriptRegex`.

Stop building the legacy engine, remove its source files and the
POSIX-only fuzzers that depended on it, and update the remaining
LibRegex tests to target the Rust-backed facade instead of the deleted
implementation. Clean up the last includes, comments, and helper paths
that only existed to support the old backend.

After this commit LibRegex has a single ECMAScript engine in-tree,
eliminating duplicated maintenance and unifying future regex work.
2026-03-27 17:32:19 +01:00
Ali Mohammad Pur
62348f1214 LibRegex: Replace virtual-inheritance dispatch with a switch interpreter 2026-03-20 16:10:25 -05:00
aplefull
6f1b7c8d50 LibRegex: Track optional capture groups for match_length_minimum
Backreferences can match the empty string when the referenced group
didn't participate in the match, so we shouldn't add their length to the
match_length_minimum, as it makes us skip valid matches.
2026-02-26 13:50:11 +01:00
aplefull
53a98f26d4 LibRegex: Exclude lookahead assertions from match_length_minimum
Lookaheads are zero-width assertions and should not affect the minimum
match length.
2026-02-26 13:50:11 +01:00
aplefull
db76c1e27c LibRegex: Account for case-insensitive matching in optimizer
Optimizer wasn't considering case-insensitive mode when checking for
overlap between the repeated element and what follows. So patterns like
`/a*A\d/i` failed to match because 'a' and 'A' weren't seen as
overlapping. We compare them in a case-insensitive way now, when i flag
is set.
2026-02-26 13:50:11 +01:00
aplefull
292c0cc486 LibRegex: Detect overlapping character classes and ranges in optimizer
`range_contains()` checked if an lhs_range was contained within the
query range, rather than checking for overlap. This caused patterns
like `/A*[A-Z]/` to fail matching "A" because the optimizer didn't
detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only
checked if two character classes were identical, not if they overlapped.
So patterns like `/\d*\w/` failed to match "1" because \d and \w were
not recognized as overlapping, even though all digits are word
characters.
2026-02-26 13:50:11 +01:00
Ali Mohammad Pur
6aba31ba13 LibRegex: Add some FileCheck-like tests to ensure opts don't break 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
fedf0f78ca LibRegex: Reject RSeekTo crossing the current-to-EOL boundary 2026-02-07 14:09:56 +01:00
aplefull
e4572aa9d7 LibRegex: Add support for regex modifiers
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
2026-01-16 15:00:00 +01:00
aplefull
6ce312e22f LibRegex: Prevent empty matches in optional quantifiers
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
2026-01-16 01:11:24 +01:00
mikiubo
535d2476a7 LibRegex: Implement proper lookbehind via new StepBack opcodes
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.

These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.

Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.

Partially fix issue: #3459
2026-01-11 23:24:49 +01:00
Ali Mohammad Pur
2677338f43 LibRegex: Process RSeekTo candidates in the correct order 2026-01-07 00:14:02 +01:00
Ali Mohammad Pur
9668927dfc LibRegex: Don't generate duplicate results for /.*/ patterns
Since the code pattern may span multiple blocks, this can generate
duplicate results; keep the last one to avoid corrupting the bytecode.
2026-01-06 19:09:27 +01:00
Ali Mohammad Pur
363f1f6568 LibRegex: Correctly calculate ForkIf target offset in tree alternatives 2026-01-06 19:09:27 +01:00
Ali Mohammad Pur
3f35d84785 LibRegex+LibJS: Flatten the bytecode buffer before regex execution
This makes it so we don't have to unnecessarily check for having a
flattened buffer; significant performance increase.
2026-01-05 18:22:11 +01:00
aplefull
3e391bdb2d LibRegex: Use token-state restoration in character class parsing
Previously, we used restoration based on character position in parser.
This caused the lexer to re-tokenize from the middle of multi-character
tokens like escape sequences, and led to incorrect parse failures for
patterns like `[\[\]]`. We would backtrack to before the first `\[`
token, then re-lex the `[` as a separate token instead of part of the
`\[` escape.

Now we save and restore the actual token object along with the lexer
index, so we keep correct token state when backtracking.
2025-12-23 11:04:16 +01:00
aplefull
ff06a4a9e5 LibRegex: Fix negated class validation for nested string properties
We were incorrectly checking for negated character class when string
properties appeared in nested classes. Now we track negation state in
the parser and correctly reject invalid string properties in negated
classes.
2025-12-23 11:04:16 +01:00
aplefull
1b570fcd61 LibRegex: Correct negated character class escapes behavior
Patterns like /[^\S]/ should match whitespace characters, but previously
would fail to match. The position would advance twice: once during the
character class comparison, and again at the end when temporary_inverse
was reset. This caused matches to be skipped incorrectly.

Now we advance at the end only if position hasn't already changed during
the loop.
2025-12-23 11:04:16 +01:00
aplefull
52a3c19c0a LibRegex: Clamp large quantifier values instead of rejecting them
Fixes parsing of regex quantifiers with extremely large numeric values.
Previously, very large quantifiers would fail to parse, but Chrome and
Firefox both clamp such large values to 2^31-1 instead of rejecting
them. So now we do the same.
2025-12-23 11:04:16 +01:00
Ali Mohammad Pur
57ef949b61 LibRegex: Account for nested 'or' compare ops
Closes #6647.
2025-11-01 17:49:57 +01:00
aplefull
8c9c2ee289 LibRegex: Track local compares in nested classes 2025-11-01 14:38:08 +01:00
aplefull
c4eef822de LibRegex: Fix backreferences to undefined capture groups
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
2025-10-16 16:37:54 +02:00
Callum Law
8ada4b7fdc LibRegex: Account for opcode size when calculating incoming jump edges
Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.
2025-07-28 17:06:58 +02:00
Ali Mohammad Pur
c7ad6cd508 LibRegex: Use code unit length in more places that apply
Finishes what 7f6b70fafb started.
Having one part use length and another code unit length lead to crashes,
the added test ensures we don't mess that up again.
2025-07-24 23:09:01 +02:00
aplefull
e2f8f5a350 LibRegex: Fix capture groups in quantified alternations
This prevents empty matches from overwriting non-empty captures in
quantified alternations. Fixes patterns like (a|a?)+ where the optional
branch would incorrectly overwrite meaningful captures with empty
strings.
2025-07-24 10:40:16 +02:00
Timothy Flynn
2dfcc4c307 LibRegex: Compare code units (not code points) in non-Unicode char range 2025-07-21 23:44:18 +02:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Ali Mohammad Pur
5b45223d5f LibRegex: Account for uppercase characters in insensitive patterns 2025-07-12 11:26:23 +02:00
Shannon Booth
bd6581fe22 LibRegex: Correctly use ClassSetReservedPunctuator in ClassSetCharacter
We had typo'd using ClassSetReservedDoublePunctuator which was
resulting in a parse error for the regex:

([^\\:]+?)

With the 'v' flag set.

Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>
2025-07-10 11:41:02 +02:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
aplefull
486602e796 LibRegex: Fix handling of + quantifier with zero-width matches
Small change that allows quantifiers using Fork* forms (e.g., +) to
succeed after one match, even if that match has zero width.
2025-06-02 15:52:26 +02:00
Ali Mohammad Pur
cfc241f61d LibRegex: Make the trie rewrite optimisation maintain the alt order
This is required by the spec.
2025-05-21 14:28:45 +02:00
ayeteadoe
11bca38f91 CMake: Build LibRegex tests in Tests/LibRegex not Meta/Lagom
As LibRegex was not specified in TEST_DIRECTORIES, the existing
Tests/LibRegex subdirectory was not actually included during
configuration. Also the RegexLibC test has not been needed
since migration away from Serenitys LibC was done, so
that test has been fully removed. I also renamed the
Regex.cpp test to TestRegex.cpp to match the naming
convention of most test targets.
2025-05-14 02:05:12 -06:00
Ali Mohammad Pur
022cd1adca LibRegex: Use the right offset when patching jumps through fork-trees
Fixes #4474.
2025-04-27 12:16:15 +02:00
Ali Mohammad Pur
fca1d33fec LibRegex: Correctly calculate the target for Repeat in table alts
Fixes a bunch of websites breaking because we now verify jump offsets by
trying to remove 0-offset jumps.
This has been broken for a good while, it was just rare to see Repeat
inside alternatives that lended themselves well to tree alts.
2025-04-24 01:17:27 -06:00
Ali Mohammad Pur
76f5dce3db LibRegex: Flatten capture group list in MatchState
This makes copying the capture group COWVector significantly cheaper,
as we no longer have to run any constructors for it - just memcpy.
2025-04-18 17:09:27 +02:00
Andreas Kling
96f1f15ad6 LibRegex: Remove unused Utf8View/Utf32View support in RegexStringView 2025-04-16 10:04:50 +02:00
Andreas Kling
5308d77600 LibRegex: Don't use Optional<T> inside regex::Match
This prevented Match from being trivially copyable, which we want it to
be for fast Vector copying.
2025-04-14 17:40:13 +02:00
Andreas Kling
54edf29f1b LibRegex: Make Match::capture_group_name an index into the string table
This removes another Match member that required destruction. The "API"
for accessing the strings is definitely a bit awkward. We'll think of
something nicer eventually.
2025-04-14 17:40:13 +02:00
Ali Mohammad Pur
69050da929 LibRegex: Merge inverse string table mappings separately 2025-04-06 20:21:16 +02:00
Ali Mohammad Pur
299b9ca572 LibRegex: Check backreference index before looking it up
If a backref happens after it's cleared, the slot may be cleared
already.
2025-04-06 20:21:16 +02:00
Andreas Kling
6b6d3b32a4 LibRegex: Remove the StringCopyMatches mode
This mode made a lot of incorrect assumptions about string lifetimes,
and instead of fixing it, let's just remove it and tweak the few unit
tests that used it.
2025-03-24 22:27:17 +00:00