ladybird

mirror of https://github.com/LadybirdBrowser/ladybird synced 2026-04-26 01:35:08 +02:00

Author	SHA1	Message	Date
Andreas Kling	3efe8043f7	LibRegex: Optimize `(^\|literal)` split prefixes Patterns like `(?:^\|;)\s*foo=...` can only start matching at input start or at occurrences of the separator, but the generic start-position loop still entered the VM at each byte and paid the leading split/backtrack cost on every miss. Teach the start-position analysis to recognize this `(^\|literal)` shape and jump straight to those candidate positions. Keep the optimization narrow: wider literal sets would need a single-pass scanner, and rescanning once per literal would make miss-heavy alternations quadratic. Add a LibRegex test for the cookie-style prefix. TestRegex still passes, and a release js benchmark exercising this shape remains fast.	2026-03-29 16:06:57 +02:00
Andreas Kling	a26a76e3ac	LibRegex: Reject bare ranges in /v set operations Intersection and subtraction chains in unicode sets mode were accepting bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects those forms; the range has to stay inside a nested class before it can participate in a set operation. Reject `ClassSetOperand::Range` when building `&&` or `--` expressions, and extend the runtime regexp tests with the reported invalid patterns plus an escaped-endpoint range case.	2026-03-27 17:32:19 +01:00
Andreas Kling	f627b7dcbb	LibRegex: Respect V8 astral literal lastIndex behavior Preserve V8's behavior for bare single-astral literals when a unicode global search starts in the middle of a surrogate pair. We were snapping that lastIndex back to the pair start unconditionally, which let /😀/gu and /\u{1F600}/gu match where V8 returns null. Expose that literal shape from LibRegex to LibJS and add runtime coverage for the bare literal case alongside a grouped control.	2026-03-27 17:32:19 +01:00
Andreas Kling	29c2fb9574	LibRegex: Keep empty-match surrogate candidates Track whether a pattern can match empty and only skip interior surrogate positions when the matcher must consume input. This keeps the unicode candidate scan fast for consuming searches without dropping valid zero-width matches such as /\B/ and /(?!...)/ between a surrogate pair's two code units. Add runtime coverage for both global lastIndex searches and plain exec() searches on zero-width unicode patterns.	2026-03-27 17:32:19 +01:00
Andreas Kling	b96872140e	LibRegex: Fix backward greedy lookbehind backtracking When a backward greedy loop backtracked toward the right edge of the input, the optimized scan for a following Char instruction could stop making progress at end of input and loop forever. This made patterns like /(?<=a.?)/ hang on non-matching input.	2026-03-27 17:32:19 +01:00
Andreas Kling	a03d1f8a5f	LibRegex: Fix greedy \w and \W ignore-case handling Keep the greedy built-in class fast path aligned with the regular VM matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i wrongly applied Unicode ignore-case behavior in the optimized loop.	2026-03-27 17:32:19 +01:00
Andreas Kling	59125dc6b1	LibRegex: Fail fast when matches need a missing literal Detect literal tails that lie on the linear success path and reject matches early when that literal never appears in the remaining input. This lets /(a+)+b/ fail quickly on long runs of a instead of spending its backtracking budget proving that the missing b can never match. Keep the tail analysis cheap while doing this. The new required-literal hint reuses trailing-literal extraction, so rewrite the linear-tail check to compute predecessor counts in one pass instead of rescanning the whole program for every instruction. That keeps large regex parses, including the large bytestring constructor test, fast. Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which should return null without throwing a backtrack-limit error.	2026-03-27 17:32:19 +01:00
Andreas Kling	6e83bf0301	LibRegex: Collapse simple `a\|a?`-style disjunctions Patterns like `^(a\|a?)+$` build a split tree where each `a` can be matched by either alternative. On short non-matching inputs that still blows through the backtrack limit even though the two alternatives are semantically equivalent to a single greedy optional matcher. Detect the narrow case of two single-term alternatives that compile to the same simple matcher, where one is required and the other is greedy `?`, and compile them as that single optional term instead of emitting a disjunction. Add a String.prototype.match regression for ("a".repeat(25) + "b").match(/^(a\|a?)+$/), which should return null instead of throwing an InternalError.	2026-03-27 17:32:19 +01:00
Andreas Kling	e506dabc05	LibRegex: Fix greedy lookbehind backtracking boundaries Backward Char instructions read the code point immediately to the left of the current position, but the greedy loop backtracking optimization was scanning for the next literal at the current position itself. That meant a lookbehind like `(?<=h.)THIS` never reconsidered the boundary after `h`, so valid matches were missed. When the quantified part was allowed to shrink to zero, as in the reported `(?<!.q.?)(?<=h.)THIS(?=.*!)` pattern, the same backtracking bug could thrash badly enough to appear hung. Fix the backward greedy-loop scan to test candidate boundaries against the code units immediately to their left. Do the same for supplementary characters by checking the surrogate pair ending at that boundary. Add String.prototype.match regressions for both the simple greedy lookbehind and the full reported pattern.	2026-03-27 17:32:19 +01:00
Andreas Kling	4f6be8ab5d	LibRegex: Preserve captures when loops reject empty matches RepeatMatcher retries a quantified atom with its own captures cleared, but if an additional greedy iteration matches the empty string the engine must fall back to the pre-iteration state. The fast VM path was clearing capture registers after backtracking from ProgressCheck, which meant the restored state from the previous successful iteration was immediately wiped out. That showed up with nested quantified captures like "xyz123xyz".match(/((123)\|(xyz))/), where the final empty expansion of the outer `` discarded the last non-empty captures and returned undefined for groups 1 and 4. The same area also needs to track each zero-width-capable iteration's start position explicitly. Initializing that state with ProgressCheck stored the end of the previous repetition instead, which regressed patterns like `/(a)/` by letting an empty iteration commit `""` into the capture instead of falling back to the pre-iteration state with an undefined capture. Clear captures before backtracking from a rejected empty iteration, and save iteration starts before entering quantified bodies so ProgressCheck only decides whether that iteration made progress. Add regressions for the reported nested quantified capture case and for `/(a)*/.exec("b")`, which should leave the capture undefined.	2026-03-27 17:32:19 +01:00
Andreas Kling	a08b334603	LibRegex: Remove stale GreedyLoop snapshot comment Drop the outdated note that GreedyLoop backtracking does not need a register snapshot. The previous commit started snapshotting registers for the optimized GreedyLoop and LazyLoop states so capture groups are restored correctly across backtracking. Keeping the old comment would describe the opposite of the code we now rely on.	2026-03-27 17:32:19 +01:00
Andreas Kling	a835a25a65	LibRegex: Restore capture state during loop backtracking Snapshot registers for GreedyLoop and LazyLoop backtrack states so failed alternatives cannot leak capture mutations into an older loop choice point. Before this change, those optimized states only restored the input position and active modifiers. If a later branch changed capture registers before failing, revisiting an earlier loop state reused the stale captures instead of the state that was current when the loop state was pushed. That let /^(b+\|a){1,2}?bc/ on "bbc" produce an invalid group 1 range with start 2 and end 1, which later tripped UBSan while RegExp.prototype.exec materialized the match result. Add a RegExp.prototype.exec regression for this pattern so we keep the expected ["bbc", "b"] result covered.	2026-03-27 17:32:19 +01:00
Andreas Kling	dc2e9bbe91	LibRegex: Avoid widening ASCII regex input Teach the Rust matcher to execute directly on ASCII-backed input. Make the VM and literal fast paths generic over an input trait so we can monomorphize separate ASCII and WTF-16 execution paths without duplicating the regex semantics. Add ASCII-specific FFI entry points and have the C++ bridge dispatch to them whenever Utf16View carries ASCII storage. This removes the per-match widening step from the hot path for exec(), test(), and find_all(), which is exactly where LibJS often hands us pure ASCII strings in 8-bit form. Keep the compiled representation and reported capture offsets in UTF-16 code units so the observable JavaScript behavior stays unchanged.	2026-03-27 17:32:19 +01:00
Andreas Kling	d7bf9d3898	LibRegex: Remove the legacy C++ ECMA-262 engine Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all in-tree users compile and execute through `ECMAScriptRegex`. Stop building the legacy engine, remove its source files and the POSIX-only fuzzers that depended on it, and update the remaining LibRegex tests to target the Rust-backed facade instead of the deleted implementation. Clean up the last includes, comments, and helper paths that only existed to support the old backend. After this commit LibRegex has a single ECMAScript engine in-tree, eliminating duplicated maintenance and unifying future regex work.	2026-03-27 17:32:19 +01:00
Andreas Kling	34d954e2d7	LibRegex: Add ECMAScriptRegex and migrate callers Add `ECMAScriptRegex`, LibRegex's C++ facade for ECMAScript regexes. The facade owns compilation, execution, captures, named groups, and error translation for the Rust backend, which lets callers stop depending on the legacy parser and matcher types directly. Use it in the remaining non-LibJS callers: URLPattern, HTML input pattern handling, and the places in LibHTTP that only needed token validation. Where a full regex engine was unnecessary, replace those call sites with direct character checks. Also update focused LibURL, LibHTTP, and WPT coverage for the migrated callers and corrected surrogate handling.	2026-03-27 17:32:19 +01:00
Andreas Kling	66fb0a8394	LibRegex/Rust: Add the ECMA-262 regex engine Add LibRegex's new Rust ECMAScript regular expression engine. Replace the old parser's direct pattern-to-bytecode pipeline with a split architecture: parse patterns into a lossless AST first, then lower that AST into bytecode for a dedicated backtracking VM. Keep the syntax tree as the place for validation, analysis, and optimization instead of teaching every transformation to rewrite partially built bytecode. Specialize this backend for the job LibJS actually needs. The old C++ engine shared one generic parser and matcher stack across ECMA-262 and POSIX modes and supported both byte-string and UTF-16 inputs. The new engine focuses on ECMA-262 semantics on WTF-16 data, which lets it model lone surrogates and other JavaScript-specific behavior directly instead of carrying POSIX and multi-encoding constraints through the whole implementation. Fill in the ECMAScript features needed to replace the old engine for real web workloads: Unicode properties and sets, lookahead and lookbehind, named groups and backreferences, modifier groups, string properties, large quantifiers, lone surrogates, and the parser and VM corner cases those features exercise. Reshape the runtime around compile-time pattern hints and a hotter VM loop. Pre-resolve Unicode properties, derive first-character, character-class, and simple-scan filters, extract safe trailing literals for anchored patterns, add literal and literal-alternation fast paths, and keep reusable scratch storage for registers, backtracking state, and modifier stacks. Teach `find_all` to stay inside one VM so global searches stop paying setup costs on every match. Make those shortcuts semantics-aware instead of merely fast. In Unicode mode, do not use literal fast paths for lone surrogates, since ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair. Likewise, only derive end-anchor suffix hints when the suffix lies on every path to `Match`, so lookarounds and disjunctions cannot skip into a shared tail and produce false negatives. This commit lands the Rust crate, the C++ wrapper, the build integration, and the initial LibJS-side plumbing needed to exercise the new engine under real RegExp callers before removing the legacy backend.	2026-03-27 17:32:19 +01:00
Ali Mohammad Pur	b50c53106a	LibRegex: Flatten bytecode when printing for debug	2026-03-22 12:24:31 +01:00
Ali Mohammad Pur	62348f1214	LibRegex: Replace virtual-inheritance dispatch with a switch interpreter	2026-03-20 16:10:25 -05:00
aplefull	6f1b7c8d50	LibRegex: Track optional capture groups for match_length_minimum Backreferences can match the empty string when the referenced group didn't participate in the match, so we shouldn't add their length to the match_length_minimum, as it makes us skip valid matches.	2026-02-26 13:50:11 +01:00
aplefull	53a98f26d4	LibRegex: Exclude lookahead assertions from match_length_minimum Lookaheads are zero-width assertions and should not affect the minimum match length.	2026-02-26 13:50:11 +01:00
aplefull	db76c1e27c	LibRegex: Account for case-insensitive matching in optimizer Optimizer wasn't considering case-insensitive mode when checking for overlap between the repeated element and what follows. So patterns like `/a*A\d/i` failed to match because 'a' and 'A' weren't seen as overlapping. We compare them in a case-insensitive way now, when i flag is set.	2026-02-26 13:50:11 +01:00
aplefull	292c0cc486	LibRegex: Detect overlapping character classes and ranges in optimizer `range_contains()` checked if an lhs_range was contained within the query range, rather than checking for overlap. This caused patterns like `/A[A-Z]/` to fail matching "A" because the optimizer didn't detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only checked if two character classes were identical, not if they overlapped. So patterns like `/\d\w/` failed to match "1" because \d and \w were not recognized as overlapping, even though all digits are word characters.	2026-02-26 13:50:11 +01:00
aplefull	2ac99312b0	LibRegex: Restore parser state for incomplete \x and \u escapes In non-Unicode mode, incomplete escape sequences like `\x0` or `\u00` should be parsed as literal characters. `read_digits_as_string` consumed hex digits but did not restore the parser position when fewer digits than required were found, and `consume_escaped_code_point` did not update `current_token` after falling back to literal 'u'.	2026-02-26 13:50:11 +01:00
aplefull	96d6dba37c	LibRegex: Backtrack 2 characters for legacy octal escapes When `\0` is followed by digits, we backtrack to parse it as a legacy octal escape. We need to backtrack 2 characters, so `parse_legacy_octal_escape` sees the leading `0` and can parse sequences correctly.	2026-02-26 13:50:11 +01:00
Jelle Raaijmakers	f175a00003	AK: Add and use `IdentityHashTraits<Integral>` These new traits are identical to `Traits<Integral>`, except that calling `.hash()` will return the value itself instead of hashing it. This should be used in cases where either the value is already a proper hash, or using the value as a hash will yield "good enough" performance in e.g. HashTable. Types larger than 32 bits are folded in on themselves. Collision tests on some popular hashing algorithms show that XOR folding slightly increases the number of collisions, but this allows `IdentityHashTraits` not to make any assumptions on which bits are the most relevant for the final hash.	2026-02-24 13:24:58 +01:00
Jelle Raaijmakers	04719e6491	LibRegex: Use Fibonacci hashing for regex matches Compared to the old boost-style combine, this reduces collisions by 36% after folding for ranges SP=0..300, IP=0..750 and reps=0..50.	2026-02-24 13:24:58 +01:00
Jelle Raaijmakers	57a9c795d3	LibGfx+LibRegex: Prevent double hashing of values We had three instances of `pair_int_hash()` being called with a value that was pulled through `u32_hash()`, which is not necessary - both arguments to `pair_int_hash()` will be properly hashed.	2026-02-23 16:44:07 +01:00
Ben Wiederhake	7fb7025d69	LibRegex: Remove unused header in Regex	2026-02-23 12:15:23 +01:00
Jelle Raaijmakers	1745926fc6	AK+Everywhere: Use MurmurHash3 for int/u64 hashing Rework our hash functions a bit for significant better performance: * Rename int_hash to u32_hash to mirror u64_hash. * Make pair_int_hash call u64_hash instead of multiple u32_hash()es. * Implement MurmurHash3's fmix32 and fmix64 for u32_hash and u64_hash. On my machine, this speeds up u32_hash by 20%, u64_hash by ~290%, and pair_int_hash by ~260%. We lose the property that an input of 0 results in something that is not 0. I've experimented with an offset to both hash functions, but it resulted in a measurable performance degradation for u64_hash. If there's a good use case for 0 not to result in 0, we can always add in that offset as a countermeasure in the future.	2026-02-20 22:47:24 +01:00
aplefull	aeec2c804c	LibRegex: Implement Unicode case-insensitive matching Previously, case-insensitive regex matching used ASCII-only case conversion (to_ascii_lowercase) even for Unicode characters. Now we implement Canonicalize abstract operation, so we can case-fold Unicode characters properly during case-insensitive matching.	2026-02-16 07:51:00 -05:00
Ali Mohammad Pur	01be1ed583	LibRegex: Mark OpCode_classes with REGEX_API	2026-02-07 14:09:56 +01:00
Ali Mohammad Pur	6aba31ba13	LibRegex: Add some FileCheck-like tests to ensure opts don't break	2026-02-07 14:09:56 +01:00
Ali Mohammad Pur	fedf0f78ca	LibRegex: Reject RSeekTo crossing the current-to-EOL boundary	2026-02-07 14:09:56 +01:00
Ali Mohammad Pur	f4d4bd9ed1	LibRegex: Ignore 'FailIfEmpty' in dot-star loop detection	2026-02-07 14:09:56 +01:00
mikiubo	5aaf08c7cf	LibRegex: Make RegexDebug resilient to empty state vectors Avoid crashing in RegexDebug when saved_positions or step_backs are empty. These cases are already handled correctly by the bytecode execution, but the debug output assumed non-empty vectors. Print a placeholder instead when no entries are present. This fixes #7502.	2026-01-21 14:20:08 +01:00
aplefull	e4572aa9d7	LibRegex: Add support for regex modifiers This commit implements the regexp-modifiers proposal. It allows us to use modification of i,m,s flags within groups using `(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.	2026-01-16 15:00:00 +01:00
aplefull	6ce312e22f	LibRegex: Prevent empty matches in optional quantifiers Step 2.b of the RepeatMatcher states that once minimum repetitions are satisfied, empty matches should not be considered for further repetitions. This was not being enforced for optional quantifiers like `?`, so we had extra capture group matches.	2026-01-16 01:11:24 +01:00
mikiubo	535d2476a7	LibRegex: Implement proper lookbehind via new StepBack opcodes This introduces a new mechanism for evaluating lookbehind assertions by adding four new bytecode opcodes: SetStepBack, IncStepBack, CheckStepBack, and CheckSavedPosition. These opcodes replace the previous GoBack-based approach and enables correct handling of variable-length lookbehind patterns, where the match length cannot be known statically. Track lookbehind greediness in the parser and propagate it to bytecode generation. Allow controlled backtracking in lookbehind bodies while avoiding incorrect captures during step-back execution. Partially fix issue: #3459	2026-01-11 23:24:49 +01:00
Jelle Raaijmakers	ae20ecf857	AK+Everywhere: Add Vector::contains(predicate) and use it No functional changes.	2026-01-08 15:27:30 +00:00
Ali Mohammad Pur	2677338f43	LibRegex: Process RSeekTo candidates in the correct order	2026-01-07 00:14:02 +01:00
Ali Mohammad Pur	9668927dfc	LibRegex: Don't generate duplicate results for /.*/ patterns Since the code pattern may span multiple blocks, this can generate duplicate results; keep the last one to avoid corrupting the bytecode.	2026-01-06 19:09:27 +01:00
Ali Mohammad Pur	363f1f6568	LibRegex: Correctly calculate ForkIf target offset in tree alternatives	2026-01-06 19:09:27 +01:00
Ali Mohammad Pur	41ce1023b8	LibRegex: Add default initialisers to ParserResult to make gcc happy	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	fbd898fb54	LibRegex: Use nicer rewrite APIs where possible Co-Authored-By: Hendiadyoin1 <leon.a@serenityos.org>	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	c1535ef65b	LibRegex: Skip multi-op compare overhead when not necessary	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	637d47ba30	LibRegex: Add an optimisation for replacing /.*x/ with a seek op This will avoid some catastrophic backtracking by just skipping to 'x'.	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	77d982d6fe	LibRegex: Restore the pure substring search optimisation for u16view `ca2f0141f6` removed only the execution side of this, which made it skip some optimisations for pure string searches. This commit implements it properly for utf16 strings instead.	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	e2c6918cdb	LibRegex: Fuse consecutive single-char Compares into a String Compare This avoids huge instruction decoding and dispatch overhead, ~40x performance improvement for /(^\|x)ppp/.	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	9d49fafdbf	LibRegex: Add an optimisation to skip forks that cannot produce a match ...and implement it for 'start of line' checks. This makes patterns like /(^\|x)ppp/ fork-free at runtime, ~30% perf improvement for that pattern.	2026-01-05 18:22:11 +01:00
Ali Mohammad Pur	0acac7f02b	LibRegex: Split basic blocks at jump targets too	2026-01-05 18:22:11 +01:00

1 2 3 4

187 Commits