ladybird

mirror of https://github.com/LadybirdBrowser/ladybird synced 2026-04-25 17:25:08 +02:00

Author	SHA1	Message	Date
Aliaksandr Kalenik	7685a5e14a	LibRegex: Inline \w and \d ranges in legacy positive char classes Previously, a character class containing any builtin (\d, \w, \s) forced the compiler down the slow "complex class" path, which emits a disjunction of alternatives and backtracks at runtime. For non-unicode, non-unicode-sets, non-negated classes, \w and \d can be inlined as their raw ASCII code-point ranges. The resulting class stays on the fast path and compiles into a single sorted CharClass instruction. The unicode/unicode_sets and negation guards are required for correctness: with the /u + /i flags, \w gains non-ASCII members via case folding (e.g. U+017F, U+212A), and negated classes have a separate, smarter compilation path.	2026-04-21 16:36:38 +02:00
Timothy Flynn	11719369e8	LibRegex+LibUnicode: Migrate Unicode Rust FFI methods to LibUnicode Let's not have LibRegex be the home of LibUnicode FFI. Move these to LibUnicode so that we can: 1. Use these helpers in other libraries more easily. 2. Swap out icu4c methods with icu4x methods all within LibUnicode.	2026-04-19 10:39:26 +02:00
Andrew Kaster	f26cb24751	Rust: Add a config file for rustfmt This sets max_width to 120, which causes a lot of reformatting.	2026-04-18 08:05:47 -04:00
Andreas Kling	f5b78e700c	LibRegex: Collapse named backreference guard Fold the named-backreference validation into the match arm.	2026-04-16 22:44:41 +02:00
Shannon Booth	ba59640ab2	LibRegex: Avoid hitting backtrack limit for bounded grouped repetitions Unrolling a bounded quantifier {min,max} into (max-min) optional Split chains lets the backtracker explore O(2^n) paths, which quickly exhausts the backtrack limit for large bounds. Fix this by compiling the optional tail via a RepeatStart/RepeatCheck counted loop when the atom is known to be non-zero-width. The loop is safe to use without a progress check precisely because the atom cannot match empty. This required making atom_can_be_zero_width recursive into group bodies: previously it conservatively returned true for all Group and NonCapturingGroup atoms, so the non-zero-width guard could never fire for grouped subexpressions. The old lowering triggered "Regular expression backtrack limit exceeded" for patterns like /'(?:\\(?:\r\n\|[\s\S])\|[^'\\\r\n]){0,32}'/, causing inputs that should match normally (or return null) to throw instead. Fixes syntax highlighting of the C++ API on https://blend2d.com	2026-04-11 18:43:48 +02:00
mikiubo	afc0f8b495	LibRegex: Use Unicode ID_Start/ID_Continue for named group names Switch to LibUnicode’s ICU-backed functions. Keep the explicit checks for '$', '_', U+200C, and U+200D that ECMAScript requires on top of the Unicode properties. Add test coverage for both the newly accepted case and regression guards for cases that must continue to work.	2026-04-08 07:31:54 -04:00
Andreas Kling	b23aa38546	AK: Adopt mimalloc v2 as main allocator Use mimalloc for Ladybird-owned allocations without overriding malloc(). Route kmalloc(), kcalloc(), krealloc(), and kfree() through mimalloc, and put the embedded Rust crates on the same allocator via a shared shim in AK/kmalloc.cpp. This also lets us drop kfree_sized(), since it no longer used its size argument. StringData, Utf16StringData, JS object storage, Rust error strings, and the CoreAudio playback helpers can all free their AK-backed storage with plain kfree(). Sanitizer builds still use the system allocator. LeakSanitizer does not reliably trace references stored in mimalloc-managed AK containers, so static caches and other long-lived roots can look leaked. Pass the old size into the Rust realloc shim so aligned fallback reallocations can move posix_memalign-backed blocks safely. Static builds still need a little linker help. macOS app binaries need the Rust allocator entry points forced in from liblagom-ak.a, while static ELF links can pull in identical allocator shim definitions from multiple Rust staticlibs. Keep the Apple -u flags and allow those duplicate shim symbols for LibJS and LibRegex links on Linux and BSD.	2026-04-08 09:57:53 +02:00
mikiubo	f84edd8173	LibRegex: Fix legacy backreference fallback digit 8 or 9 When a multi-digit decimal escape like \81 exceeds the total capture group count in non-Unicode mode, the parser falls back to legacy octal reinterpretation. However, digits '8' and '9' are not valid in octal (base 8), so passing them to parse_legacy_octal() caused an unwrap() panic on None from char::to_digit(8). Treat '8' and '9' as literal characters in the fallback path, matching the behavior already present for the non-backreference.	2026-04-04 12:12:00 +02:00
Shannon Booth	0b946f39b2	Meta: Remove ENABLE_RUST build configuration option This now is required to be set for the browser to function.	2026-04-02 22:59:42 +02:00
Andreas Kling	201e615aad	LibRegex: Preserve set-op direction in backward /v matches Unicode-set intersection and subtraction always lowered their post-consumption checks as lookbehinds. That is correct while the outer matcher runs forward, but inside lookbehind the consumed text sits to the right of the current position, so the checks must flip to lookahead instead. Because we always looked left, patterns like `(?<=[[^A-Z]--[A-Z]])P{N}` and the reported fuzz case missed matches whenever the character before the consumed one changed the set-operation result. Preserve the surrounding match direction when compiling those checks, and add coverage for reduced subtraction and intersection cases plus the original regression.	2026-03-31 15:59:04 +02:00
Andreas Kling	e0de4ef33e	LibRegex: Reject negated /v classes that contain strings Negated unicode-set classes are only valid when every member is single-code-point. We already rejected direct string-valued members such as `q{ab}` and `p{RGI_Emoji_Flag_Sequence}` inside `[^...]`, but nested class-set operands could still smuggle them through, so patterns like `[^[[p{Emoji_Keycap_Sequence}]]]` and the reported fuzzed literal compiled instead of throwing. Validate nested class-set expressions after parsing and reject only the negated `/v` classes whose resulting multi-code-point strings are still non-empty. Track the exact string members contributed by string literals, string properties, and nested classes so intersections and subtractions can eliminate them before the negated-class check runs. Add constructor and literal coverage for the reduced nested-string cases, the original regression, and valid negated set operations that remove every string member.	2026-03-31 15:59:04 +02:00
Andreas Kling	6347827eb8	LibJS: Retry Unicode low-surrogate lastIndex positions RegExpBuiltinExec used to snap any Unicode lastIndex that landed on a low surrogate back to the start of the pair. That matched `/😀/u`, but it skipped valid empty matches when the original low-surrogate position was itself matchable, such as `/p{Script=Cyrillic}?(?<!\D)/v` on `"A😘"` and the longer fuzzed global case. Try the snapped position first, then retry the original lastIndex when the snapped match fails. Only keep that second result when it is empty at the original low-surrogate position, so consuming /u and /v matches still cannot split a surrogate pair. In the Rust VM, treat backward Unicode matches that start between surrogate halves as having no complete code point to their left, which matches V8's lookbehind behavior for those positions. Add reduced coverage for both low-surrogate exec cases, the original global match count regression, and the consuming-match retry regression.	2026-03-31 15:59:04 +02:00
Andreas Kling	33f9d464de	LibRegex: Preserve negated class direction in lookbehind Compile the synthetic assertion for negated classes in the same direction as the surrounding matcher. We were hardcoding a lookahead for `[^...]`, so lookbehind checked the wrong side of the current position and missed valid `/v` matches such as `(?<=[^\p{Emoji}])2`. Apply the same fix to unicode set classes, since they use the same negative-lookaround-plus-`AnyChar` lowering for complements. Add reduced `RegExp.js` coverage for both `[^\p{Emoji}]` and `[[^\p{Emoji}]]` in lookbehind, plus the original complex `/gv` regression.	2026-03-31 15:59:04 +02:00
Andreas Kling	c828c87408	LibRegex: Leave suffix minima for repeated simple loops Repeated simple loops like "a+".repeat(100) compile to a chain of greedy loop instructions. When one loop failed, the VM only knew how to give back one character at a time unless the next instruction was a literal char, so V8's regexp-fallback case ran into the backtrack limit instead of finding the obvious match. When a greedy simple loop is followed by more loops for the same matcher, sum their minimum counts and backtrack far enough to leave the missing suffix in one step. If that suffix is already available to the right, still give back one character so the VM makes progress instead of reusing the same greedy state forever. The RegExp runtime test now covers the Chromium regexp-fallback case through exec(), global exec(), and both replace() paths, plus bounded same-matcher chains where the suffix minimum is partly missing or already available.	2026-03-31 15:59:04 +02:00
Andreas Kling	f24bdb9f94	LibRegex: Honor wrapped start anchors in search hints The VM only marked patterns as anchored when the first real instruction was AssertStart. That missed anchors hidden behind capture setup or a leading positive lookahead, so patterns like /(^bar)/ and /(?=^bar)\w+/ fell back to whole-subject scanning. Teach the hint analysis to see through the non-consuming wrappers we emit ahead of a leading ^, but still run the literal prefilters before anchored and sticky VM attempts. Missing required literals should stay cheap no-matches instead of running the full backtracking VM and raising the step limit. The RegExp runtime test now covers the Chromium ascii-regexp-subject case on a long ASCII input and anchored, sticky, and global no-match cases where the required literal is absent.	2026-03-31 15:59:04 +02:00
Andreas Kling	c12647fc37	LibRegex: Clamp braced quantifier bounds to 2^31 - 1 Browsers clamp braced quantifier bounds above 2^31 - 1 before checking whether {min,max} is in order. The parser still kept values up to u32::MAX, so patterns like {2147483648,2147483647} were rejected even though both bounds should collapse to the same limit. Clamp parsed braced quantifier bounds to 2^31 - 1 as they are read. This keeps the existing acceptance of huge exact and open-ended quantifiers and makes the constructor and regex literal paths agree with other engines on the out-of-order edge cases. The RegExp runtime and syntax tests now cover accepted huge quantifiers, clamped order validation, and huge literal forms. The reported constructor and literal cases also match other engines.	2026-03-31 15:59:04 +02:00
Andreas Kling	87b22d0c04	LibRegex: Compare set operands by exact string length Unicode set intersection and subtraction were compiled by matching one operand and then checking the others with lookbehind. That let a longer string operand reject a shorter match whenever the longer string happened to end at the same position. Group unicode set operands by exact match length and compile each length class separately, longest first. This keeps longest-match semantics for unions while making intersection and subtraction compare only strings of the same length. The new RegExp runtime cases cover both the reported [a-z]--\q{abc} regression and the related intersection/subtraction mismatches, and they now agree with V8.	2026-03-31 15:59:04 +02:00
Andreas Kling	50b137f527	LibJS: Reject mixed surrogate forms in RegExp names Reject surrogate pairs in named group names unless both halves come from the same raw form. A literal surrogate half was being normalized into \uXXXX before LibRegex parsed the pattern, which let mixed literal and escaped forms sneak through. Validate surrogate handling on the UTF-16 pattern before normalization, but only treat \k<...> as a named backreference when the parser would do that too. Legacy regexes without named groups still use \k as an identity escape, so their literal text must not be rejected by the pre-scan. Add runtime and syntax tests for the mixed forms, the valid literal, fixed-width, and braced escape cases, and the legacy \k literals.	2026-03-31 15:59:04 +02:00
Andreas Kling	c167bfd50a	Meta: Make Rust FFI headers reproducible Teach import_rust_crate() to track RustFFI.h as a real build output, and teach the relevant Rust build scripts to rerun when their FFI inputs change. Also keep a copy of RustFFI.h in Cargo's own OUT_DIR and restore the configured FFI output from that cached copy after cargo rustc runs. This fixes the case where Ninja knows the header is missing, reruns the custom command, and Cargo exits without rerunning build.rs because the crate itself is already up to date. When Cargo leaves multiple hashed build-script outputs behind, pick the newest root-output before restoring RustFFI.h so we do not copy a stale header after Rust-side API changes. Finally, track the remaining Rust-side inputs that could leave build artifacts stale: LibUnicode and LibJS now rerun build.rs when src/ changes, and the asmintgen rule now depends on Cargo.lock, the BytecodeDef path dependency, and newly added Rust source files.	2026-03-31 15:59:04 +02:00
Andreas Kling	1f413da8e8	LibRegex: Anchor sticky matches at lastIndex Sticky regular expressions were still using the generic forward-search paths inside LibRegex and only enforcing the lastIndex check back in LibJS after a match had already been found. That made tokenizer-style sticky patterns spend most of their time scanning for later matches that would be thrown away. Route sticky exec() and test() through an anchored VM entry point that runs exactly once at the requested start position while keeping the existing literal-hint pruning. Add focused test-js coverage for sticky literals, alternations, classes, quantifiers, and WebIDL-style token patterns.	2026-03-29 16:06:57 +02:00
Andreas Kling	e4008393d3	LibRegex: Extract broader required-literal prefilters Analyze parsed patterns for literal substrings that every match must contain and feed the longest one into the existing required-literal hint. This lets the VM reject obvious misses for optional-prefix extractors, literal-prefix parsers, and common-substring alternations without adding more shape-specific fast paths. Use ASCII case-insensitive probes when the required literal comes from a legacy `/i` pattern, and skip modifier groups that change ignore-case so the hint does not over-constrain scoped flag behavior. Keep the analysis within the same 64-code-unit budget that the VM hint already uses so large fixed quantifiers and long literal alternations do not blow up compile time. Add TestRegex coverage for assignment extractors, ASCII ignore- case prefilters, common-substring alternations, and the compile- time regression shapes. TestRegex passes, the long-pattern js probes behave correctly, and regex benchmark spot checks stay fast.	2026-03-29 16:06:57 +02:00
Andreas Kling	0f46413baa	LibRegex: Speed up `\bliteral\b` fast paths Extract whole-pattern `\bliteral\b` cases outside Unicode modes and handle them with literal search plus ASCII word- boundary checks. This keeps the optimization narrow while still covering the token-style patterns that show up often in our regex workloads. Keep Unicode word-boundary patterns on the VM path, since `\b` semantics change there. Add tests for plain matching, ASCII ignore-case matching, and the Unicode fallback case.	2026-03-29 16:06:57 +02:00
Andreas Kling	5b8af39167	LibRegex: Speed up ASCII ignore-case literal alternations Extend the literal alternation fast path to cover ASCII-only `/i` patterns outside Unicode modes. Search candidate start positions with a folded first-code-unit set, then verify the alternatives in source order to preserve ECMAScript semantics. Keep Unicode ignore-case alternations on the VM path, since they can match non-ASCII code points through canonicalization. Add tests for mixed-prefix alternatives, source-order selection, and the Unicode fallback behavior.	2026-03-29 16:06:57 +02:00
Andreas Kling	72a3dce507	LibRegex: Speed up ASCII ignore-case literal search Whole-pattern ignore-case literals already bypass the VM, but the search path still checked every start position linearly with full case-fold comparisons. For ASCII literals we can cheaply scan for the first code unit and only verify the rest at candidate positions. This keeps the existing non-ASCII fallback intact while speeding up common benchmark shapes such as `/puebzr/i`, `/zfvr/gi`, and `/##yv22##/gi`. Add tests covering both alphabetic and punctuation-led ASCII literals under ignore-case matching. TestRegex passes. In Build/release js microbenchmarks, representative cases dropped from about 47ms to 38ms, 95ms to 63ms, and 121ms to 79ms.	2026-03-29 16:06:57 +02:00
Andreas Kling	3efe8043f7	LibRegex: Optimize `(^\|literal)` split prefixes Patterns like `(?:^\|;)\s*foo=...` can only start matching at input start or at occurrences of the separator, but the generic start-position loop still entered the VM at each byte and paid the leading split/backtrack cost on every miss. Teach the start-position analysis to recognize this `(^\|literal)` shape and jump straight to those candidate positions. Keep the optimization narrow: wider literal sets would need a single-pass scanner, and rescanning once per literal would make miss-heavy alternations quadratic. Add a LibRegex test for the cookie-style prefix. TestRegex still passes, and a release js benchmark exercising this shape remains fast.	2026-03-29 16:06:57 +02:00
Andreas Kling	a26a76e3ac	LibRegex: Reject bare ranges in /v set operations Intersection and subtraction chains in unicode sets mode were accepting bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects those forms; the range has to stay inside a nested class before it can participate in a set operation. Reject `ClassSetOperand::Range` when building `&&` or `--` expressions, and extend the runtime regexp tests with the reported invalid patterns plus an escaped-endpoint range case.	2026-03-27 17:32:19 +01:00
Andreas Kling	f627b7dcbb	LibRegex: Respect V8 astral literal lastIndex behavior Preserve V8's behavior for bare single-astral literals when a unicode global search starts in the middle of a surrogate pair. We were snapping that lastIndex back to the pair start unconditionally, which let /😀/gu and /\u{1F600}/gu match where V8 returns null. Expose that literal shape from LibRegex to LibJS and add runtime coverage for the bare literal case alongside a grouped control.	2026-03-27 17:32:19 +01:00
Andreas Kling	29c2fb9574	LibRegex: Keep empty-match surrogate candidates Track whether a pattern can match empty and only skip interior surrogate positions when the matcher must consume input. This keeps the unicode candidate scan fast for consuming searches without dropping valid zero-width matches such as /\B/ and /(?!...)/ between a surrogate pair's two code units. Add runtime coverage for both global lastIndex searches and plain exec() searches on zero-width unicode patterns.	2026-03-27 17:32:19 +01:00
Andreas Kling	b96872140e	LibRegex: Fix backward greedy lookbehind backtracking When a backward greedy loop backtracked toward the right edge of the input, the optimized scan for a following Char instruction could stop making progress at end of input and loop forever. This made patterns like /(?<=a.?)/ hang on non-matching input.	2026-03-27 17:32:19 +01:00
Andreas Kling	a03d1f8a5f	LibRegex: Fix greedy \w and \W ignore-case handling Keep the greedy built-in class fast path aligned with the regular VM matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i wrongly applied Unicode ignore-case behavior in the optimized loop.	2026-03-27 17:32:19 +01:00
Andreas Kling	59125dc6b1	LibRegex: Fail fast when matches need a missing literal Detect literal tails that lie on the linear success path and reject matches early when that literal never appears in the remaining input. This lets /(a+)+b/ fail quickly on long runs of a instead of spending its backtracking budget proving that the missing b can never match. Keep the tail analysis cheap while doing this. The new required-literal hint reuses trailing-literal extraction, so rewrite the linear-tail check to compute predecessor counts in one pass instead of rescanning the whole program for every instruction. That keeps large regex parses, including the large bytestring constructor test, fast. Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which should return null without throwing a backtrack-limit error.	2026-03-27 17:32:19 +01:00
Andreas Kling	6e83bf0301	LibRegex: Collapse simple `a\|a?`-style disjunctions Patterns like `^(a\|a?)+$` build a split tree where each `a` can be matched by either alternative. On short non-matching inputs that still blows through the backtrack limit even though the two alternatives are semantically equivalent to a single greedy optional matcher. Detect the narrow case of two single-term alternatives that compile to the same simple matcher, where one is required and the other is greedy `?`, and compile them as that single optional term instead of emitting a disjunction. Add a String.prototype.match regression for ("a".repeat(25) + "b").match(/^(a\|a?)+$/), which should return null instead of throwing an InternalError.	2026-03-27 17:32:19 +01:00
Andreas Kling	e506dabc05	LibRegex: Fix greedy lookbehind backtracking boundaries Backward Char instructions read the code point immediately to the left of the current position, but the greedy loop backtracking optimization was scanning for the next literal at the current position itself. That meant a lookbehind like `(?<=h.)THIS` never reconsidered the boundary after `h`, so valid matches were missed. When the quantified part was allowed to shrink to zero, as in the reported `(?<!.q.?)(?<=h.)THIS(?=.*!)` pattern, the same backtracking bug could thrash badly enough to appear hung. Fix the backward greedy-loop scan to test candidate boundaries against the code units immediately to their left. Do the same for supplementary characters by checking the surrogate pair ending at that boundary. Add String.prototype.match regressions for both the simple greedy lookbehind and the full reported pattern.	2026-03-27 17:32:19 +01:00
Andreas Kling	4f6be8ab5d	LibRegex: Preserve captures when loops reject empty matches RepeatMatcher retries a quantified atom with its own captures cleared, but if an additional greedy iteration matches the empty string the engine must fall back to the pre-iteration state. The fast VM path was clearing capture registers after backtracking from ProgressCheck, which meant the restored state from the previous successful iteration was immediately wiped out. That showed up with nested quantified captures like "xyz123xyz".match(/((123)\|(xyz))/), where the final empty expansion of the outer `` discarded the last non-empty captures and returned undefined for groups 1 and 4. The same area also needs to track each zero-width-capable iteration's start position explicitly. Initializing that state with ProgressCheck stored the end of the previous repetition instead, which regressed patterns like `/(a)/` by letting an empty iteration commit `""` into the capture instead of falling back to the pre-iteration state with an undefined capture. Clear captures before backtracking from a rejected empty iteration, and save iteration starts before entering quantified bodies so ProgressCheck only decides whether that iteration made progress. Add regressions for the reported nested quantified capture case and for `/(a)*/.exec("b")`, which should leave the capture undefined.	2026-03-27 17:32:19 +01:00
Andreas Kling	a08b334603	LibRegex: Remove stale GreedyLoop snapshot comment Drop the outdated note that GreedyLoop backtracking does not need a register snapshot. The previous commit started snapshotting registers for the optimized GreedyLoop and LazyLoop states so capture groups are restored correctly across backtracking. Keeping the old comment would describe the opposite of the code we now rely on.	2026-03-27 17:32:19 +01:00
Andreas Kling	a835a25a65	LibRegex: Restore capture state during loop backtracking Snapshot registers for GreedyLoop and LazyLoop backtrack states so failed alternatives cannot leak capture mutations into an older loop choice point. Before this change, those optimized states only restored the input position and active modifiers. If a later branch changed capture registers before failing, revisiting an earlier loop state reused the stale captures instead of the state that was current when the loop state was pushed. That let /^(b+\|a){1,2}?bc/ on "bbc" produce an invalid group 1 range with start 2 and end 1, which later tripped UBSan while RegExp.prototype.exec materialized the match result. Add a RegExp.prototype.exec regression for this pattern so we keep the expected ["bbc", "b"] result covered.	2026-03-27 17:32:19 +01:00
Andreas Kling	dc2e9bbe91	LibRegex: Avoid widening ASCII regex input Teach the Rust matcher to execute directly on ASCII-backed input. Make the VM and literal fast paths generic over an input trait so we can monomorphize separate ASCII and WTF-16 execution paths without duplicating the regex semantics. Add ASCII-specific FFI entry points and have the C++ bridge dispatch to them whenever Utf16View carries ASCII storage. This removes the per-match widening step from the hot path for exec(), test(), and find_all(), which is exactly where LibJS often hands us pure ASCII strings in 8-bit form. Keep the compiled representation and reported capture offsets in UTF-16 code units so the observable JavaScript behavior stays unchanged.	2026-03-27 17:32:19 +01:00
Andreas Kling	d7bf9d3898	LibRegex: Remove the legacy C++ ECMA-262 engine Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all in-tree users compile and execute through `ECMAScriptRegex`. Stop building the legacy engine, remove its source files and the POSIX-only fuzzers that depended on it, and update the remaining LibRegex tests to target the Rust-backed facade instead of the deleted implementation. Clean up the last includes, comments, and helper paths that only existed to support the old backend. After this commit LibRegex has a single ECMAScript engine in-tree, eliminating duplicated maintenance and unifying future regex work.	2026-03-27 17:32:19 +01:00
Andreas Kling	34d954e2d7	LibRegex: Add ECMAScriptRegex and migrate callers Add `ECMAScriptRegex`, LibRegex's C++ facade for ECMAScript regexes. The facade owns compilation, execution, captures, named groups, and error translation for the Rust backend, which lets callers stop depending on the legacy parser and matcher types directly. Use it in the remaining non-LibJS callers: URLPattern, HTML input pattern handling, and the places in LibHTTP that only needed token validation. Where a full regex engine was unnecessary, replace those call sites with direct character checks. Also update focused LibURL, LibHTTP, and WPT coverage for the migrated callers and corrected surrogate handling.	2026-03-27 17:32:19 +01:00
Andreas Kling	66fb0a8394	LibRegex/Rust: Add the ECMA-262 regex engine Add LibRegex's new Rust ECMAScript regular expression engine. Replace the old parser's direct pattern-to-bytecode pipeline with a split architecture: parse patterns into a lossless AST first, then lower that AST into bytecode for a dedicated backtracking VM. Keep the syntax tree as the place for validation, analysis, and optimization instead of teaching every transformation to rewrite partially built bytecode. Specialize this backend for the job LibJS actually needs. The old C++ engine shared one generic parser and matcher stack across ECMA-262 and POSIX modes and supported both byte-string and UTF-16 inputs. The new engine focuses on ECMA-262 semantics on WTF-16 data, which lets it model lone surrogates and other JavaScript-specific behavior directly instead of carrying POSIX and multi-encoding constraints through the whole implementation. Fill in the ECMAScript features needed to replace the old engine for real web workloads: Unicode properties and sets, lookahead and lookbehind, named groups and backreferences, modifier groups, string properties, large quantifiers, lone surrogates, and the parser and VM corner cases those features exercise. Reshape the runtime around compile-time pattern hints and a hotter VM loop. Pre-resolve Unicode properties, derive first-character, character-class, and simple-scan filters, extract safe trailing literals for anchored patterns, add literal and literal-alternation fast paths, and keep reusable scratch storage for registers, backtracking state, and modifier stacks. Teach `find_all` to stay inside one VM so global searches stop paying setup costs on every match. Make those shortcuts semantics-aware instead of merely fast. In Unicode mode, do not use literal fast paths for lone surrogates, since ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair. Likewise, only derive end-anchor suffix hints when the suffix lies on every path to `Match`, so lookarounds and disjunctions cannot skip into a shared tail and produce false negatives. This commit lands the Rust crate, the C++ wrapper, the build integration, and the initial LibJS-side plumbing needed to exercise the new engine under real RegExp callers before removing the legacy backend.	2026-03-27 17:32:19 +01:00
Ali Mohammad Pur	b50c53106a	LibRegex: Flatten bytecode when printing for debug	2026-03-22 12:24:31 +01:00
Ali Mohammad Pur	62348f1214	LibRegex: Replace virtual-inheritance dispatch with a switch interpreter	2026-03-20 16:10:25 -05:00
aplefull	6f1b7c8d50	LibRegex: Track optional capture groups for match_length_minimum Backreferences can match the empty string when the referenced group didn't participate in the match, so we shouldn't add their length to the match_length_minimum, as it makes us skip valid matches.	2026-02-26 13:50:11 +01:00
aplefull	53a98f26d4	LibRegex: Exclude lookahead assertions from match_length_minimum Lookaheads are zero-width assertions and should not affect the minimum match length.	2026-02-26 13:50:11 +01:00
aplefull	db76c1e27c	LibRegex: Account for case-insensitive matching in optimizer Optimizer wasn't considering case-insensitive mode when checking for overlap between the repeated element and what follows. So patterns like `/a*A\d/i` failed to match because 'a' and 'A' weren't seen as overlapping. We compare them in a case-insensitive way now, when i flag is set.	2026-02-26 13:50:11 +01:00
aplefull	292c0cc486	LibRegex: Detect overlapping character classes and ranges in optimizer `range_contains()` checked if an lhs_range was contained within the query range, rather than checking for overlap. This caused patterns like `/A[A-Z]/` to fail matching "A" because the optimizer didn't detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only checked if two character classes were identical, not if they overlapped. So patterns like `/\d\w/` failed to match "1" because \d and \w were not recognized as overlapping, even though all digits are word characters.	2026-02-26 13:50:11 +01:00
aplefull	2ac99312b0	LibRegex: Restore parser state for incomplete \x and \u escapes In non-Unicode mode, incomplete escape sequences like `\x0` or `\u00` should be parsed as literal characters. `read_digits_as_string` consumed hex digits but did not restore the parser position when fewer digits than required were found, and `consume_escaped_code_point` did not update `current_token` after falling back to literal 'u'.	2026-02-26 13:50:11 +01:00
aplefull	96d6dba37c	LibRegex: Backtrack 2 characters for legacy octal escapes When `\0` is followed by digits, we backtrack to parse it as a legacy octal escape. We need to backtrack 2 characters, so `parse_legacy_octal_escape` sees the leading `0` and can parse sequences correctly.	2026-02-26 13:50:11 +01:00
Jelle Raaijmakers	f175a00003	AK: Add and use `IdentityHashTraits<Integral>` These new traits are identical to `Traits<Integral>`, except that calling `.hash()` will return the value itself instead of hashing it. This should be used in cases where either the value is already a proper hash, or using the value as a hash will yield "good enough" performance in e.g. HashTable. Types larger than 32 bits are folded in on themselves. Collision tests on some popular hashing algorithms show that XOR folding slightly increases the number of collisions, but this allows `IdentityHashTraits` not to make any assumptions on which bits are the most relevant for the final hash.	2026-02-24 13:24:58 +01:00
Jelle Raaijmakers	04719e6491	LibRegex: Use Fibonacci hashing for regex matches Compared to the old boost-style combine, this reduces collisions by 36% after folding for ranges SP=0..300, IP=0..750 and reps=0..50.	2026-02-24 13:24:58 +01:00

1 2 3 4 5

211 Commits