ladybird

mirror of https://github.com/LadybirdBrowser/ladybird synced 2026-05-11 09:27:00 +02:00

Author	SHA1	Message	Date
Andreas Kling	73812e12d2	LibJS: Fast path Array.prototype.indexOf on packed arrays Skip the generic HasProperty and Get loop when indexOf operates on a simple packed array. In that case every index below length is an own data property, so a direct scan of the packed indexed property storage gives the same strict-equality result without the per-element property lookup ceremony. Only use the fast path when the current packed storage size still matches the length captured before fromIndex coercion, since that coercion can run user code and mutate the receiver. Add coverage for length and storage mutations during fromIndex coercion.	2026-04-27 08:39:37 +02:00
Andreas Kling	e65e85cb8c	LibJS: Materialize arguments object for shorthand `{ arguments }` The parser only set `might_need_arguments_object` when an `arguments` or `eval` Identifier went through `consume()`, but shorthand object properties create the reference via `make_identifier()` directly. As a result `function f() { return { arguments } }` allocated an `arguments` local, never initialized it, and crashed at runtime when the property was read. Fall back to scope-driven detection: if scope analysis allocated a non-lexical `arguments` local for the function, treat it as a real arguments-object reference and emit `CreateArguments`. Skip the fallback when a function declaration named `arguments` claims the local, since that local belongs to the function, not the arguments object. Add a runtime test covering shorthand inside a free function and a method, plus a regression test for `({ eval } = ...)` to confirm destructuring assignment doesn't accidentally trigger arguments materialization.	2026-04-27 08:04:11 +02:00
Aliaksandr Kalenik	bfbc3352b5	LibJS: Extend Array.prototype.shift() fast path to holey arrays indexed_take_first() already memmoves elements down for both Packed and Holey storage, but the caller at ArrayPrototype::shift() only entered the fast path for Packed arrays. Holey arrays fell through to the spec-literal per-element loop (has_property / get / set / delete_property_or_throw), which is substantially slower. Add a separate Holey predicate with the additional safety checks the spec semantics require: default_prototype_chain_intact() (so HasProperty on a hole doesn't escape to a poisoned prototype) and extensible() (so set() on a hole slot doesn't create a new own property on a non-extensible object). The existing Packed predicate is left unchanged -- packed arrays don't need these checks because every index in [0, size) is already an own data property. Allows us to fail at Cloudflare Turnstile way much faster!	2026-04-23 21:47:21 +02:00
Andreas Kling	e5d4c5cce8	LibJS: Check TDZ state in asm environment calls GetCalleeAndThisFromEnvironment treated a binding as initialized when its value slot was not <empty>. Declarative bindings do not encode TDZ in that slot, though: uninitialized bindings keep a separate initialized flag and their value starts as undefined. That let the first slow-path TDZ failure populate the environment cache, then a second call at the same site reused the cached coordinate and turned the required ReferenceError into a TypeError from calling undefined. Check Binding.initialized in the asm fast path instead and cover the cached second-hit case with a regression test.	2026-04-20 11:23:34 +02:00
Timothy Flynn	10ce847931	LibJS+LibUnicode: Use LibUnicode as appropriate for lexing JavaScript Now that LibUnicode exports its character type APIs in Rust, we can use them to lex identifiers and whitespace. Fixes #8870.	2026-04-19 10:39:26 +02:00
Andreas Kling	583fa475fb	LibJS: Call RawNativeFunction directly from asm Call The asm interpreter already inlines ECMAScript calls, but builtin calls still went through the generic C++ Call slow path even when the callee was a plain native function pointer. That added an avoidable boundary around hot builtin calls and kept asm from taking full advantage of the new RawNativeFunction representation. Teach the asm Call handler to recognize RawNativeFunction, allocate the callee frame on the interpreter stack, copy the call-site arguments, and jump straight to the stored C++ entry point. NativeJavaScriptBackedFunction and other non-raw callees keep falling through to the existing C++ slow path unchanged.	2026-04-15 15:57:48 +02:00
Andreas Kling	8a9d5ee1a1	LibJS: Separate raw and capturing native functions NativeFunction previously stored an AK::Function for every builtin, even when the callable was just a plain C++ entry point. That mixed together two different representations, made simple builtins carry capture storage they did not need, and forced the GC to treat every native function as if it might contain captured JS values. Introduce RawNativeFunction for plain NativeFunctionPointer callees and keep AK::Function-backed callables on a CapturingNativeFunction subclass. Update the straightforward native registrations in LibJS and LibWeb to use the raw representation, while leaving exported Wasm functions on the capturing path because they still capture state. Wrap UniversalGlobalScope's byte-length strategy lambda in Function<...> explicitly so it keeps selecting the capturing NativeFunction::create overload.	2026-04-15 15:57:48 +02:00
Timothy Flynn	4b1ecbc9df	LibJS+LibUnicode: Update icu4x's calendar module to 2.2.0 First: We now pin the icu4x version to an exact number. Minor version upgrades can result in noisy deprecation warnings and API changes which cause tests to fail. So let's pin the known-good version exactly. This patch updates our Rust calendar module to use the new APIs. This initially caused some test failures due to the new Date::try_new API (which is the recommended replacement for Date::try_new_from_codes) having quite a limited year range of +/-9999. So we must use other APIs (Date::try_from_fields and calendrical_calculations::gregorian) to avoid these limits. http://github.com/unicode-org/icu4x/blob/main/CHANGELOG.md#icu4x-22	2026-04-14 18:12:31 -04:00
Andreas Kling	517812647a	LibJS: Pack asm Call shared-data metadata Pack the asm Call fast path metadata next to the executable pointer so the interpreter can fetch both values with one paired load. This removes several dependent shared-data loads from the hot path. Keep the executable pointer and packed metadata in separate registers through this binding so the fast path can still use the paired-load layout after any non-strict this adjustment. Lower the packed metadata flag checks correctly on x86_64 as well. Those bits now live above bit 31, so the generator uses bt for single- bit high masks and covers that path with a unit test. Add a runtime test that exercises both object and global this binding through the asm Call fast path.	2026-04-14 12:37:12 +02:00
Andreas Kling	8c7c46f8ec	LibJS: Inline asm interpreter JS Call fast path Handle inline-eligible JS-to-JS Call directly in asmint.asm instead of routing the whole operation through AsmInterpreter.cpp. The asm handler now validates the callee, binds `this` for the non-allocating cases, reserves the callee InterpreterStack frame, populates the ExecutionContext header and Value tail, and enters the callee bytecode at pc 0. Keep the cases that need NewFunctionEnvironment() or sloppy `this` boxing on a narrow helper that still builds an inline frame. This preserves the existing inline-call semantics for promise-job ordering, receiver binding, and sloppy global-this handling while keeping the common path in assembly. Add regression coverage for closure-capturing callees, sloppy primitive receivers, and sloppy undefined receivers.	2026-04-14 08:14:43 +02:00
Andreas Kling	12a916d14a	LibJS: Handle AsmInt returns without C++ helpers Handle Return and End entirely in AsmInt when leaving an inline frame. The handlers now restore the caller, update the interpreter stack bookkeeping directly, and bump the execution generation without bouncing through AsmInterpreter.cpp. Add WeakRef tests that exercise both inline Return and inline End so this path stays covered.	2026-04-14 08:14:43 +02:00
RubenKelevra	a1ae402bb9	LibJS: Make folded non-decimal prefix parsing UTF-8-safe Folded StringToNumber() and StringToBigInt() detected non-decimal prefixes by slicing the string at byte offset 2. On UTF-8 input this could split at a non-character boundary and panic. To prevent this, we replace the byte-based split with ASCII prefix stripping and preserve rejection of empty suffixes such as "0x", "0o", and "0b" explicitly before parsing the remaining digits. This makes non-decimal prefix folding UTF-8-safe and preserves the expected invalid-result behavior for empty prefixed literals. Tests: Add regression coverage for folded StringToNumber() and StringToBigInt() non-decimal prefix handling to validate the UTF-8 safety fix as 'string-to-number-and-bigint-non-decimal-prefixes.js'. These tests ensure empty suffixes like "0x", "0o", and "0b" and other invalid prefixed forms stay invalid, while valid prefixed literals continue to be accepted. Since we removed a byte-index split in folded StringToNumber()/StringToBigInt() coercion that could panic when byte index 2 landed inside a multi-byte UTF-8 scalar, we add regression tests for representative panic-shape inputs to ensure these coercions now return invalid results instead of crashing as 'string-to-number-and-bigint-utf8-boundary.js'	2026-04-12 17:36:51 +02:00
Shannon Booth	ba59640ab2	LibRegex: Avoid hitting backtrack limit for bounded grouped repetitions Unrolling a bounded quantifier {min,max} into (max-min) optional Split chains lets the backtracker explore O(2^n) paths, which quickly exhausts the backtrack limit for large bounds. Fix this by compiling the optional tail via a RepeatStart/RepeatCheck counted loop when the atom is known to be non-zero-width. The loop is safe to use without a progress check precisely because the atom cannot match empty. This required making atom_can_be_zero_width recursive into group bodies: previously it conservatively returned true for all Group and NonCapturingGroup atoms, so the non-zero-width guard could never fire for grouped subexpressions. The old lowering triggered "Regular expression backtrack limit exceeded" for patterns like /'(?:\\(?:\r\n\|[\s\S])\|[^'\\\r\n]){0,32}'/, causing inputs that should match normally (or return null) to throw instead. Fixes syntax highlighting of the C++ API on https://blend2d.com	2026-04-11 18:43:48 +02:00
Andreas Kling	0969a5cd9a	LibJS: Use Substring for legacy regexp statics Keep the legacy regexp static properties backed by PrimitiveString values instead of eagerly copied Utf16Strings. lastMatch, leftContext, rightContext, and $1-$9 now materialize lazy Substrings from the original match input when accessed. Keep RegExp.input as a separate slot from the match source so manual writes do not rewrite the last match state. Add coverage for that behavior and for rope-backed UTF-16 inputs.	2026-04-11 00:35:36 +02:00
Andreas Kling	8b8136b480	LibJS: Use Substring in Intl.Segmenter Keep the primitive string that segment() creates alongside the UTF-16 buffer used by LibUnicode. Segment data objects can then return lazy Substring instances for "segment" and reuse the original PrimitiveString for "input" instead of copying both strings. Add a rope-backed UTF-16 segmenter test that exercises both containing() and iterator results.	2026-04-11 00:35:36 +02:00
Andreas Kling	a9bedc5a8d	LibJS: Use Substring for string slices Route the obvious substring-producing string operations through the new PrimitiveString substring factory. Direct indexing, at(), charAt(), slice(), substring(), substr(), and the plain-string split path can now return lazy JS::Substring values backed by the original string. Add runtime coverage for rope-backed string operations so these lazy string slices stay exercised across both ASCII and UTF-16 inputs.	2026-04-11 00:35:36 +02:00
Andreas Kling	f6f791969d	LibJS: Use Substring for regexp results Return JS::Substring objects from the builtin regexp exec and split paths instead of eagerly copying UTF-16 slices into new strings. Matches, captures, and split pieces can now point back at the original input until someone asks for the string contents. Add focused runtime coverage for UTF-16 captures and regex split captures so these lazy slices stay exercised.	2026-04-11 00:35:36 +02:00
Andreas Kling	879ac36e45	LibJS: Cache stable for-in iteration at bytecode sites Cache the flattened enumerable key snapshot for each `for..in` site and reuse a `PropertyNameIterator` when the receiver shape, dictionary generation, indexed storage kind and length, prototype chain validity, and magical-length state still match. Handle packed indexed receivers as well as plain named-property objects. Teach `ObjectPropertyIteratorNext` in `asmint.asm` to return cached property values directly and to fall back to the slow iterator logic when any guard fails. Treat arrays' hidden non-enumerable `length` property as a visited name for for-in shadowing, and include the receiver's magical-length state in the cache key so arrays and plain objects do not share snapshots. Add `test-js` and `test-js-bytecode` coverage for mixed numeric and named keys, packed receiver transitions, re-entry, iterator reuse, GC retention, array length shadowing, and same-site cache reuse.	2026-04-10 15:12:53 +02:00
Andreas Kling	4c1e2222df	LibJS: Fast-path safe writes into holey array holes Teach the asm PutByValue path to materialize in-bounds holey array elements directly when the receiver is a normal extensible Array with the default prototype chain and no indexed interference. This avoids bouncing through generic property setting while preserving the lazy holey length model. Keep the fast path narrow so inherited setters, inherited non-writable properties, and non-extensible arrays still fall back to the generic semantics. Add regression coverage for those cases alongside the large holey array stress tests.	2026-04-09 20:06:42 +02:00
Andreas Kling	da1c943161	LibJS: Make holey array lengths lazy Treat setting a large array length as a logical length change instead of forcing dictionary indexed storage or materializing every hole up front. This keeps dense fills on Array(length) on the holey indexed path and only falls back to sparse storage when later writes actually create a large realized gap. The asm indexed get/put fast paths assumed holey arrays always had a materialized backing store. Guard those paths with a capacity check so lazy holey arrays fall back safely until an index has been realized. Add regression coverage for very large holey arrays and for densely filling a large holey array after pre-sizing it with Array(length).	2026-04-09 20:06:42 +02:00
mikiubo	afc0f8b495	LibRegex: Use Unicode ID_Start/ID_Continue for named group names Switch to LibUnicode’s ICU-backed functions. Keep the explicit checks for '$', '_', U+200C, and U+200D that ECMAScript requires on top of the Unicode properties. Add test coverage for both the newly accepted case and regression guards for cases that must continue to work.	2026-04-08 07:31:54 -04:00
Shannon Booth	f27bc38aa7	Everywhere: Remove ShadowRealm support The proposal has not seemed to progress for a while, and there is a open issue about module imports which breaks HTML integration. While we could probably make an AD-HOC change to fix that issue, it is deep enough in the JS engine that I am not particularly keen on making that change. Until other browsers begin to make positive signals about shipping ShadowRealms, let's remove our implementation for now. There is still some cleanup that can be done with regard to the HTML integration, but there are a few more items that need to be untangled there.	2026-04-05 13:57:58 +02:00
mikiubo	f84edd8173	LibRegex: Fix legacy backreference fallback digit 8 or 9 When a multi-digit decimal escape like \81 exceeds the total capture group count in non-Unicode mode, the parser falls back to legacy octal reinterpretation. However, digits '8' and '9' are not valid in octal (base 8), so passing them to parse_legacy_octal() caused an unwrap() panic on None from char::to_digit(8). Treat '8' and '9' as literal characters in the fallback path, matching the behavior already present for the non-backreference.	2026-04-04 12:12:00 +02:00
Shannon Booth	adabc5cedb	LibJS: Handle empty UTF-16 strings in Rust FFI Treat zero length UTF-16 slices from Rust as empty views at the FFI boundary instead of assuming a non null backing pointer. Add a regression test which crashed before these changes. Fixes a crash loading github.com/ladybirdbrowser/ladybird.	2026-03-31 22:33:36 +02:00
Andreas Kling	201e615aad	LibRegex: Preserve set-op direction in backward /v matches Unicode-set intersection and subtraction always lowered their post-consumption checks as lookbehinds. That is correct while the outer matcher runs forward, but inside lookbehind the consumed text sits to the right of the current position, so the checks must flip to lookahead instead. Because we always looked left, patterns like `(?<=[[^A-Z]--[A-Z]])P{N}` and the reported fuzz case missed matches whenever the character before the consumed one changed the set-operation result. Preserve the surrounding match direction when compiling those checks, and add coverage for reduced subtraction and intersection cases plus the original regression.	2026-03-31 15:59:04 +02:00
Andreas Kling	e0de4ef33e	LibRegex: Reject negated /v classes that contain strings Negated unicode-set classes are only valid when every member is single-code-point. We already rejected direct string-valued members such as `q{ab}` and `p{RGI_Emoji_Flag_Sequence}` inside `[^...]`, but nested class-set operands could still smuggle them through, so patterns like `[^[[p{Emoji_Keycap_Sequence}]]]` and the reported fuzzed literal compiled instead of throwing. Validate nested class-set expressions after parsing and reject only the negated `/v` classes whose resulting multi-code-point strings are still non-empty. Track the exact string members contributed by string literals, string properties, and nested classes so intersections and subtractions can eliminate them before the negated-class check runs. Add constructor and literal coverage for the reduced nested-string cases, the original regression, and valid negated set operations that remove every string member.	2026-03-31 15:59:04 +02:00
Andreas Kling	6347827eb8	LibJS: Retry Unicode low-surrogate lastIndex positions RegExpBuiltinExec used to snap any Unicode lastIndex that landed on a low surrogate back to the start of the pair. That matched `/😀/u`, but it skipped valid empty matches when the original low-surrogate position was itself matchable, such as `/p{Script=Cyrillic}?(?<!\D)/v` on `"A😘"` and the longer fuzzed global case. Try the snapped position first, then retry the original lastIndex when the snapped match fails. Only keep that second result when it is empty at the original low-surrogate position, so consuming /u and /v matches still cannot split a surrogate pair. In the Rust VM, treat backward Unicode matches that start between surrogate halves as having no complete code point to their left, which matches V8's lookbehind behavior for those positions. Add reduced coverage for both low-surrogate exec cases, the original global match count regression, and the consuming-match retry regression.	2026-03-31 15:59:04 +02:00
Andreas Kling	33f9d464de	LibRegex: Preserve negated class direction in lookbehind Compile the synthetic assertion for negated classes in the same direction as the surrounding matcher. We were hardcoding a lookahead for `[^...]`, so lookbehind checked the wrong side of the current position and missed valid `/v` matches such as `(?<=[^\p{Emoji}])2`. Apply the same fix to unicode set classes, since they use the same negative-lookaround-plus-`AnyChar` lowering for complements. Add reduced `RegExp.js` coverage for both `[^\p{Emoji}]` and `[[^\p{Emoji}]]` in lookbehind, plus the original complex `/gv` regression.	2026-03-31 15:59:04 +02:00
Andreas Kling	c828c87408	LibRegex: Leave suffix minima for repeated simple loops Repeated simple loops like "a+".repeat(100) compile to a chain of greedy loop instructions. When one loop failed, the VM only knew how to give back one character at a time unless the next instruction was a literal char, so V8's regexp-fallback case ran into the backtrack limit instead of finding the obvious match. When a greedy simple loop is followed by more loops for the same matcher, sum their minimum counts and backtrack far enough to leave the missing suffix in one step. If that suffix is already available to the right, still give back one character so the VM makes progress instead of reusing the same greedy state forever. The RegExp runtime test now covers the Chromium regexp-fallback case through exec(), global exec(), and both replace() paths, plus bounded same-matcher chains where the suffix minimum is partly missing or already available.	2026-03-31 15:59:04 +02:00
Andreas Kling	f24bdb9f94	LibRegex: Honor wrapped start anchors in search hints The VM only marked patterns as anchored when the first real instruction was AssertStart. That missed anchors hidden behind capture setup or a leading positive lookahead, so patterns like /(^bar)/ and /(?=^bar)\w+/ fell back to whole-subject scanning. Teach the hint analysis to see through the non-consuming wrappers we emit ahead of a leading ^, but still run the literal prefilters before anchored and sticky VM attempts. Missing required literals should stay cheap no-matches instead of running the full backtracking VM and raising the step limit. The RegExp runtime test now covers the Chromium ascii-regexp-subject case on a long ASCII input and anchored, sticky, and global no-match cases where the required literal is absent.	2026-03-31 15:59:04 +02:00
Andreas Kling	c12647fc37	LibRegex: Clamp braced quantifier bounds to 2^31 - 1 Browsers clamp braced quantifier bounds above 2^31 - 1 before checking whether {min,max} is in order. The parser still kept values up to u32::MAX, so patterns like {2147483648,2147483647} were rejected even though both bounds should collapse to the same limit. Clamp parsed braced quantifier bounds to 2^31 - 1 as they are read. This keeps the existing acceptance of huge exact and open-ended quantifiers and makes the constructor and regex literal paths agree with other engines on the out-of-order edge cases. The RegExp runtime and syntax tests now cover accepted huge quantifiers, clamped order validation, and huge literal forms. The reported constructor and literal cases also match other engines.	2026-03-31 15:59:04 +02:00
Andreas Kling	87b22d0c04	LibRegex: Compare set operands by exact string length Unicode set intersection and subtraction were compiled by matching one operand and then checking the others with lookbehind. That let a longer string operand reject a shorter match whenever the longer string happened to end at the same position. Group unicode set operands by exact match length and compile each length class separately, longest first. This keeps longest-match semantics for unions while making intersection and subtraction compare only strings of the same length. The new RegExp runtime cases cover both the reported [a-z]--\q{abc} regression and the related intersection/subtraction mismatches, and they now agree with V8.	2026-03-31 15:59:04 +02:00
Andreas Kling	50b137f527	LibJS: Reject mixed surrogate forms in RegExp names Reject surrogate pairs in named group names unless both halves come from the same raw form. A literal surrogate half was being normalized into \uXXXX before LibRegex parsed the pattern, which let mixed literal and escaped forms sneak through. Validate surrogate handling on the UTF-16 pattern before normalization, but only treat \k<...> as a named backreference when the parser would do that too. Legacy regexes without named groups still use \k as an identity escape, so their literal text must not be rejected by the pre-scan. Add runtime and syntax tests for the mixed forms, the valid literal, fixed-width, and braced escape cases, and the legacy \k literals.	2026-03-31 15:59:04 +02:00
Andreas Kling	1f413da8e8	LibRegex: Anchor sticky matches at lastIndex Sticky regular expressions were still using the generic forward-search paths inside LibRegex and only enforcing the lastIndex check back in LibJS after a match had already been found. That made tokenizer-style sticky patterns spend most of their time scanning for later matches that would be thrown away. Route sticky exec() and test() through an anchored VM entry point that runs exactly once at the requested start position while keeping the existing literal-hint pruning. Add focused test-js coverage for sticky literals, alternations, classes, quantifiers, and WebIDL-style token patterns.	2026-03-29 16:06:57 +02:00
Andreas Kling	a26a76e3ac	LibRegex: Reject bare ranges in /v set operations Intersection and subtraction chains in unicode sets mode were accepting bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects those forms; the range has to stay inside a nested class before it can participate in a set operation. Reject `ClassSetOperand::Range` when building `&&` or `--` expressions, and extend the runtime regexp tests with the reported invalid patterns plus an escaped-endpoint range case.	2026-03-27 17:32:19 +01:00
Andreas Kling	f627b7dcbb	LibRegex: Respect V8 astral literal lastIndex behavior Preserve V8's behavior for bare single-astral literals when a unicode global search starts in the middle of a surrogate pair. We were snapping that lastIndex back to the pair start unconditionally, which let /😀/gu and /\u{1F600}/gu match where V8 returns null. Expose that literal shape from LibRegex to LibJS and add runtime coverage for the bare literal case alongside a grouped control.	2026-03-27 17:32:19 +01:00
Andreas Kling	29c2fb9574	LibRegex: Keep empty-match surrogate candidates Track whether a pattern can match empty and only skip interior surrogate positions when the matcher must consume input. This keeps the unicode candidate scan fast for consuming searches without dropping valid zero-width matches such as /\B/ and /(?!...)/ between a surrogate pair's two code units. Add runtime coverage for both global lastIndex searches and plain exec() searches on zero-width unicode patterns.	2026-03-27 17:32:19 +01:00
Andreas Kling	b96872140e	LibRegex: Fix backward greedy lookbehind backtracking When a backward greedy loop backtracked toward the right edge of the input, the optimized scan for a following Char instruction could stop making progress at end of input and loop forever. This made patterns like /(?<=a.?)/ hang on non-matching input.	2026-03-27 17:32:19 +01:00
Andreas Kling	a03d1f8a5f	LibRegex: Fix greedy \w and \W ignore-case handling Keep the greedy built-in class fast path aligned with the regular VM matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i wrongly applied Unicode ignore-case behavior in the optimized loop.	2026-03-27 17:32:19 +01:00
Andreas Kling	4b5f1a9a98	LibJS: Fix RegExp.prototype.test ignoring overridden prototype exec The fast path in RegExp.prototype.test() checked for an own "exec" property on the instance via storage_has(), but did not detect when RegExp.prototype.exec itself had been replaced. This meant overriding exec on the prototype was silently ignored, violating the spec which requires test() to go through RegExpExec() and thus the overridable exec method. Fix this by resolving "exec" via a full prototype chain lookup and checking whether the result is still the built-in exec, matching the approach already used in Symbol.replace's fast path.	2026-03-27 17:32:19 +01:00
Andreas Kling	59125dc6b1	LibRegex: Fail fast when matches need a missing literal Detect literal tails that lie on the linear success path and reject matches early when that literal never appears in the remaining input. This lets /(a+)+b/ fail quickly on long runs of a instead of spending its backtracking budget proving that the missing b can never match. Keep the tail analysis cheap while doing this. The new required-literal hint reuses trailing-literal extraction, so rewrite the linear-tail check to compute predecessor counts in one pass instead of rescanning the whole program for every instruction. That keeps large regex parses, including the large bytestring constructor test, fast. Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which should return null without throwing a backtrack-limit error.	2026-03-27 17:32:19 +01:00
Andreas Kling	6e83bf0301	LibRegex: Collapse simple `a\|a?`-style disjunctions Patterns like `^(a\|a?)+$` build a split tree where each `a` can be matched by either alternative. On short non-matching inputs that still blows through the backtrack limit even though the two alternatives are semantically equivalent to a single greedy optional matcher. Detect the narrow case of two single-term alternatives that compile to the same simple matcher, where one is required and the other is greedy `?`, and compile them as that single optional term instead of emitting a disjunction. Add a String.prototype.match regression for ("a".repeat(25) + "b").match(/^(a\|a?)+$/), which should return null instead of throwing an InternalError.	2026-03-27 17:32:19 +01:00
Andreas Kling	e506dabc05	LibRegex: Fix greedy lookbehind backtracking boundaries Backward Char instructions read the code point immediately to the left of the current position, but the greedy loop backtracking optimization was scanning for the next literal at the current position itself. That meant a lookbehind like `(?<=h.)THIS` never reconsidered the boundary after `h`, so valid matches were missed. When the quantified part was allowed to shrink to zero, as in the reported `(?<!.q.?)(?<=h.)THIS(?=.*!)` pattern, the same backtracking bug could thrash badly enough to appear hung. Fix the backward greedy-loop scan to test candidate boundaries against the code units immediately to their left. Do the same for supplementary characters by checking the surrogate pair ending at that boundary. Add String.prototype.match regressions for both the simple greedy lookbehind and the full reported pattern.	2026-03-27 17:32:19 +01:00
Andreas Kling	4f6be8ab5d	LibRegex: Preserve captures when loops reject empty matches RepeatMatcher retries a quantified atom with its own captures cleared, but if an additional greedy iteration matches the empty string the engine must fall back to the pre-iteration state. The fast VM path was clearing capture registers after backtracking from ProgressCheck, which meant the restored state from the previous successful iteration was immediately wiped out. That showed up with nested quantified captures like "xyz123xyz".match(/((123)\|(xyz))/), where the final empty expansion of the outer `` discarded the last non-empty captures and returned undefined for groups 1 and 4. The same area also needs to track each zero-width-capable iteration's start position explicitly. Initializing that state with ProgressCheck stored the end of the previous repetition instead, which regressed patterns like `/(a)/` by letting an empty iteration commit `""` into the capture instead of falling back to the pre-iteration state with an undefined capture. Clear captures before backtracking from a rejected empty iteration, and save iteration starts before entering quantified bodies so ProgressCheck only decides whether that iteration made progress. Add regressions for the reported nested quantified capture case and for `/(a)*/.exec("b")`, which should leave the capture undefined.	2026-03-27 17:32:19 +01:00
Andreas Kling	a835a25a65	LibRegex: Restore capture state during loop backtracking Snapshot registers for GreedyLoop and LazyLoop backtrack states so failed alternatives cannot leak capture mutations into an older loop choice point. Before this change, those optimized states only restored the input position and active modifiers. If a later branch changed capture registers before failing, revisiting an earlier loop state reused the stale captures instead of the state that was current when the loop state was pushed. That let /^(b+\|a){1,2}?bc/ on "bbc" produce an invalid group 1 range with start 2 and end 1, which later tripped UBSan while RegExp.prototype.exec materialized the match result. Add a RegExp.prototype.exec regression for this pattern so we keep the expected ["bbc", "b"] result covered.	2026-03-27 17:32:19 +01:00
Andreas Kling	e243e146de	LibJS+LibRegex: Switch RegExp over to the Rust engine Switch LibJS `RegExp` over to the Rust-backed `ECMAScriptRegex` APIs. Route `new RegExp()`, regex literals, and the RegExp builtins through the new compile and exec APIs, and stop re-validating patterns with the deleted C++ parser on the way in. Preserve the observable error behavior by carrying structured compile errors and backtracking-limit failures across the FFI boundary. Cache compiled regex state and named capture metadata on `RegExpObject` in the new representation. Use the new API surface to simplify and speed up the builtin paths too: share `exec_internal`, cache compiled regex pointers, keep the legacy RegExp statics lazy, run global replace through batch `find_all`, and optimize replace, test, split, and String helper paths. Add regression tests for those JavaScript-visible paths.	2026-03-27 17:32:19 +01:00
Andreas Kling	66fb0a8394	LibRegex/Rust: Add the ECMA-262 regex engine Add LibRegex's new Rust ECMAScript regular expression engine. Replace the old parser's direct pattern-to-bytecode pipeline with a split architecture: parse patterns into a lossless AST first, then lower that AST into bytecode for a dedicated backtracking VM. Keep the syntax tree as the place for validation, analysis, and optimization instead of teaching every transformation to rewrite partially built bytecode. Specialize this backend for the job LibJS actually needs. The old C++ engine shared one generic parser and matcher stack across ECMA-262 and POSIX modes and supported both byte-string and UTF-16 inputs. The new engine focuses on ECMA-262 semantics on WTF-16 data, which lets it model lone surrogates and other JavaScript-specific behavior directly instead of carrying POSIX and multi-encoding constraints through the whole implementation. Fill in the ECMAScript features needed to replace the old engine for real web workloads: Unicode properties and sets, lookahead and lookbehind, named groups and backreferences, modifier groups, string properties, large quantifiers, lone surrogates, and the parser and VM corner cases those features exercise. Reshape the runtime around compile-time pattern hints and a hotter VM loop. Pre-resolve Unicode properties, derive first-character, character-class, and simple-scan filters, extract safe trailing literals for anchored patterns, add literal and literal-alternation fast paths, and keep reusable scratch storage for registers, backtracking state, and modifier stacks. Teach `find_all` to stay inside one VM so global searches stop paying setup costs on every match. Make those shortcuts semantics-aware instead of merely fast. In Unicode mode, do not use literal fast paths for lone surrogates, since ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair. Likewise, only derive end-anchor suffix hints when the suffix lies on every path to `Match`, so lookarounds and disjunctions cannot skip into a shared tail and produce false negatives. This commit lands the Rust crate, the C++ wrapper, the build integration, and the initial LibJS-side plumbing needed to exercise the new engine under real RegExp callers before removing the legacy backend.	2026-03-27 17:32:19 +01:00
Andreas Kling	3646912c5c	Tests: Import V8 and WebKit regexp suites Add scripts for importing the V8 and WebKit regexp suites into `Tests/LibJS/Runtime/3rdparty/` and commit the imported tests. Also update the local harness helpers so these suites can run under LibJS. In particular, teach the assertion shims to compare RegExp values the way the imported tests expect and to treat `new Function(...)` throwing as a valid `assertThrows` case. This gives the regex rewrite a large bank of external conformance tests that exercise parser and matcher behavior beyond in-tree coverage.	2026-03-27 17:32:19 +01:00
Andreas Kling	72fdf10248	LibJS: Don't cache dictionary shapes in NewObject premade shape cache Dictionary shapes are mutable (properties added/removed in-place via add_property_without_transition), so sharing them between objects via the NewObject premade shape cache is unsafe. When a large object literal (>64 properties) is created repeatedly in a loop, the first execution transitions to a dictionary shape, which CacheObjectShape then caches. Subsequent iterations create new objects all pointing to the same dictionary shape. If any of these objects adds a new property, it mutates the shared shape in-place, increasing its property_count, but only grows its own named property storage. Other objects sharing the shape are left with undersized storage, leading to a heap-buffer-overflow when the GC visits their edges. Fix this by not caching dictionary shapes. This means object literals with >64 properties won't get the premade-shape fast path, but such literals are uncommon.	2026-03-22 11:10:01 -05:00
InvalidUsernameException	61e6dbe4e7	LibJS: Copy object of member expression to preserve evaluation order Noticed this pattern when reading some minified JS while debugging a seemingly unrelated problem and immediately got suspicious because of my earlier, similar fixes.	2026-03-22 15:40:38 +01:00

1 2 3

138 Commits