Skip the generic HasProperty and Get loop when indexOf operates on a
simple packed array. In that case every index below length is an own
data property, so a direct scan of the packed indexed property storage
gives the same strict-equality result without the per-element property
lookup
ceremony.
Only use the fast path when the current packed storage size still
matches the length captured before fromIndex coercion, since that
coercion can run user code and mutate the receiver. Add coverage for
length and storage mutations during fromIndex coercion.
The parser only set `might_need_arguments_object` when an `arguments`
or `eval` Identifier went through `consume()`, but shorthand object
properties create the reference via `make_identifier()` directly. As
a result `function f() { return { arguments } }` allocated an
`arguments` local, never initialized it, and crashed at runtime when
the property was read.
Fall back to scope-driven detection: if scope analysis allocated a
non-lexical `arguments` local for the function, treat it as a real
arguments-object reference and emit `CreateArguments`. Skip the
fallback when a function declaration named `arguments` claims the
local, since that local belongs to the function, not the arguments
object.
Add a runtime test covering shorthand inside a free function and a
method, plus a regression test for `({ eval } = ...)` to confirm
destructuring assignment doesn't accidentally trigger arguments
materialization.
indexed_take_first() already memmoves elements down for both Packed and
Holey storage, but the caller at ArrayPrototype::shift() only entered
the fast path for Packed arrays. Holey arrays fell through to the
spec-literal per-element loop (has_property / get / set /
delete_property_or_throw), which is substantially slower.
Add a separate Holey predicate with the additional safety checks the
spec semantics require: default_prototype_chain_intact() (so
HasProperty on a hole doesn't escape to a poisoned prototype) and
extensible() (so set() on a hole slot doesn't create a new own
property on a non-extensible object). The existing Packed predicate
is left unchanged -- packed arrays don't need these checks because
every index in [0, size) is already an own data property.
Allows us to fail at Cloudflare Turnstile way much faster!
GetCalleeAndThisFromEnvironment treated a binding as initialized when
its value slot was not <empty>. Declarative bindings do not encode TDZ
in that slot, though: uninitialized bindings keep a separate initialized
flag and their value starts as undefined.
That let the first slow-path TDZ failure populate the environment cache,
then a second call at the same site reused the cached coordinate and
turned the required ReferenceError into a TypeError from calling
undefined.
Check Binding.initialized in the asm fast path instead and cover the
cached second-hit case with a regression test.
The asm interpreter already inlines ECMAScript calls, but builtin calls
still went through the generic C++ Call slow path even when the callee
was a plain native function pointer. That added an avoidable boundary
around hot builtin calls and kept asm from taking full advantage of the
new RawNativeFunction representation.
Teach the asm Call handler to recognize RawNativeFunction, allocate the
callee frame on the interpreter stack, copy the call-site arguments,
and jump straight to the stored C++ entry point.
NativeJavaScriptBackedFunction and other non-raw callees keep falling
through to the existing C++ slow path unchanged.
NativeFunction previously stored an AK::Function for every builtin,
even when the callable was just a plain C++ entry point. That mixed
together two different representations, made simple builtins carry
capture storage they did not need, and forced the GC to treat every
native function as if it might contain captured JS values.
Introduce RawNativeFunction for plain NativeFunctionPointer callees
and keep AK::Function-backed callables on a CapturingNativeFunction
subclass. Update the straightforward native registrations in LibJS
and LibWeb to use the raw representation, while leaving exported
Wasm functions on the capturing path because they still capture
state.
Wrap UniversalGlobalScope's byte-length strategy lambda in
Function<...> explicitly so it keeps selecting the capturing
NativeFunction::create overload.
First: We now pin the icu4x version to an exact number. Minor version
upgrades can result in noisy deprecation warnings and API changes which
cause tests to fail. So let's pin the known-good version exactly.
This patch updates our Rust calendar module to use the new APIs. This
initially caused some test failures due to the new Date::try_new API
(which is the recommended replacement for Date::try_new_from_codes)
having quite a limited year range of +/-9999. So we must use other
APIs (Date::try_from_fields and calendrical_calculations::gregorian)
to avoid these limits.
http://github.com/unicode-org/icu4x/blob/main/CHANGELOG.md#icu4x-22
Pack the asm Call fast path metadata next to the executable pointer
so the interpreter can fetch both values with one paired load. This
removes several dependent shared-data loads from the hot path.
Keep the executable pointer and packed metadata in separate registers
through this binding so the fast path can still use the paired-load
layout after any non-strict this adjustment.
Lower the packed metadata flag checks correctly on x86_64 as well.
Those bits now live above bit 31, so the generator uses bt for single-
bit high masks and covers that path with a unit test.
Add a runtime test that exercises both object and global this binding
through the asm Call fast path.
Handle inline-eligible JS-to-JS Call directly in asmint.asm instead
of routing the whole operation through AsmInterpreter.cpp.
The asm handler now validates the callee, binds `this` for the
non-allocating cases, reserves the callee InterpreterStack frame,
populates the ExecutionContext header and Value tail, and enters the
callee bytecode at pc 0.
Keep the cases that need NewFunctionEnvironment() or sloppy `this`
boxing on a narrow helper that still builds an inline frame. This
preserves the existing inline-call semantics for promise-job ordering,
receiver binding, and sloppy global-this handling while keeping the
common path in assembly.
Add regression coverage for closure-capturing callees, sloppy
primitive receivers, and sloppy undefined receivers.
Handle Return and End entirely in AsmInt when leaving an inline frame.
The handlers now restore the caller, update the interpreter stack
bookkeeping directly, and bump the execution generation without
bouncing through AsmInterpreter.cpp.
Add WeakRef tests that exercise both inline Return and inline End
so this path stays covered.
Folded StringToNumber() and StringToBigInt() detected non-decimal
prefixes by slicing the string at byte offset 2. On UTF-8 input this
could split at a non-character boundary and panic.
To prevent this, we replace the byte-based split with ASCII prefix
stripping and preserve rejection of empty suffixes such as "0x", "0o",
and "0b" explicitly before parsing the remaining digits.
This makes non-decimal prefix folding UTF-8-safe and preserves the
expected invalid-result behavior for empty prefixed literals.
Tests:
Add regression coverage for folded StringToNumber() and StringToBigInt()
non-decimal prefix handling to validate the UTF-8 safety fix as
'string-to-number-and-bigint-non-decimal-prefixes.js'.
These tests ensure empty suffixes like "0x", "0o", and "0b" and
other invalid prefixed forms stay invalid, while valid prefixed
literals continue to be accepted.
Since we removed a byte-index split in folded
StringToNumber()/StringToBigInt() coercion that could panic when byte
index 2 landed inside a multi-byte UTF-8 scalar, we add regression
tests for representative panic-shape inputs to ensure these coercions
now return invalid results instead of crashing as
'string-to-number-and-bigint-utf8-boundary.js'
Unrolling a bounded quantifier {min,max} into (max-min) optional Split
chains lets the backtracker explore O(2^n) paths, which quickly
exhausts the backtrack limit for large bounds.
Fix this by compiling the optional tail via a RepeatStart/RepeatCheck
counted loop when the atom is known to be non-zero-width. The loop
is safe to use without a progress check precisely because the atom
cannot match empty.
This required making atom_can_be_zero_width recursive into group bodies:
previously it conservatively returned true for all Group and
NonCapturingGroup atoms, so the non-zero-width guard could never fire
for grouped subexpressions.
The old lowering triggered "Regular expression backtrack limit exceeded"
for patterns like /'(?:\\(?:\r\n|[\s\S])|[^'\\\r\n]){0,32}'/, causing
inputs that should match normally (or return null) to throw instead.
Fixes syntax highlighting of the C++ API on https://blend2d.com
Keep the legacy regexp static properties backed by PrimitiveString
values instead of eagerly copied Utf16Strings. lastMatch, leftContext,
rightContext, and $1-$9 now materialize lazy Substrings from the
original match input when accessed.
Keep RegExp.input as a separate slot from the match source so manual
writes do not rewrite the last match state. Add coverage for that
behavior and for rope-backed UTF-16 inputs.
Keep the primitive string that segment() creates alongside the UTF-16
buffer used by LibUnicode. Segment data objects can then return lazy
Substring instances for "segment" and reuse the original
PrimitiveString for "input" instead of copying both strings.
Add a rope-backed UTF-16 segmenter test that exercises both
containing() and iterator results.
Route the obvious substring-producing string operations through the
new PrimitiveString substring factory. Direct indexing, at(), charAt(),
slice(), substring(), substr(), and the plain-string split path can now
return lazy JS::Substring values backed by the original string.
Add runtime coverage for rope-backed string operations so these lazy
string slices stay exercised across both ASCII and UTF-16 inputs.
Return JS::Substring objects from the builtin regexp exec and split
paths instead of eagerly copying UTF-16 slices into new strings.
Matches, captures, and split pieces can now point back at the original
input until someone asks for the string contents.
Add focused runtime coverage for UTF-16 captures and regex split
captures so these lazy slices stay exercised.
Cache the flattened enumerable key snapshot for each `for..in` site and
reuse a `PropertyNameIterator` when the receiver shape, dictionary
generation, indexed storage kind and length, prototype chain
validity, and magical-length state still match.
Handle packed indexed receivers as well as plain named-property
objects. Teach `ObjectPropertyIteratorNext` in `asmint.asm` to return
cached property values directly and to fall back to the slow iterator
logic when any guard fails.
Treat arrays' hidden non-enumerable `length` property as a visited
name for for-in shadowing, and include the receiver's magical-length
state in the cache key so arrays and plain objects do not share
snapshots.
Add `test-js` and `test-js-bytecode` coverage for mixed numeric and
named keys, packed receiver transitions, re-entry, iterator reuse, GC
retention, array length shadowing, and same-site cache reuse.
Teach the asm PutByValue path to materialize in-bounds holey array
elements directly when the receiver is a normal extensible Array with
the default prototype chain and no indexed interference. This avoids
bouncing through generic property setting while preserving the lazy
holey length model.
Keep the fast path narrow so inherited setters, inherited non-writable
properties, and non-extensible arrays still fall back to the generic
semantics. Add regression coverage for those cases alongside the large
holey array stress tests.
Treat setting a large array length as a logical length change instead of
forcing dictionary indexed storage or materializing every hole up front.
This keeps dense fills on Array(length) on the holey indexed path and
only falls back to sparse storage when later writes actually create a
large realized gap.
The asm indexed get/put fast paths assumed holey arrays always had a
materialized backing store. Guard those paths with a capacity check so
lazy holey arrays fall back safely until an index has been realized.
Add regression coverage for very large holey arrays and for densely
filling a large holey array after pre-sizing it with Array(length).
Switch to LibUnicode’s ICU-backed functions.
Keep the explicit checks for '$', '_', U+200C, and U+200D that
ECMAScript requires on top of the Unicode properties.
Add test coverage for both the newly accepted case
and regression guards for cases that must continue to work.
The proposal has not seemed to progress for a while, and there is
a open issue about module imports which breaks HTML integration.
While we could probably make an AD-HOC change to fix that issue,
it is deep enough in the JS engine that I am not particularly
keen on making that change.
Until other browsers begin to make positive signals about shipping
ShadowRealms, let's remove our implementation for now.
There is still some cleanup that can be done with regard to the
HTML integration, but there are a few more items that need to be
untangled there.
When a multi-digit decimal escape like \81 exceeds the total capture
group count in non-Unicode mode, the parser falls back to legacy octal
reinterpretation. However, digits '8' and '9' are not valid in octal
(base 8), so passing them to parse_legacy_octal() caused an unwrap()
panic on None from char::to_digit(8).
Treat '8' and '9' as literal characters in the fallback path, matching
the behavior already present for the non-backreference.
Treat zero length UTF-16 slices from Rust as empty views at the FFI
boundary instead of assuming a non null backing pointer.
Add a regression test which crashed before these changes. Fixes
a crash loading github.com/ladybirdbrowser/ladybird.
Unicode-set intersection and subtraction always lowered their
post-consumption checks as lookbehinds. That is correct while the
outer matcher runs forward, but inside lookbehind the consumed text
sits to the right of the current position, so the checks must flip
to lookahead instead. Because we always looked left, patterns like
`(?<=[[^A-Z]--[A-Z]])P{N}` and the reported fuzz case missed
matches whenever the character before the consumed one changed the
set-operation result.
Preserve the surrounding match direction when compiling those
checks, and add coverage for reduced subtraction and intersection
cases plus the original regression.
Negated unicode-set classes are only valid when every member is
single-code-point. We already rejected direct string-valued members
such as `q{ab}` and `p{RGI_Emoji_Flag_Sequence}` inside `[^...]`,
but nested class-set operands could still smuggle them through, so
patterns like `[^[[p{Emoji_Keycap_Sequence}]]]` and the reported
fuzzed literal compiled instead of throwing.
Validate nested class-set expressions after parsing and reject only the
negated `/v` classes whose resulting multi-code-point strings are still
non-empty. Track the exact string members contributed by string
literals, string properties, and nested classes so intersections and
subtractions can eliminate them before the negated-class check runs.
Add constructor and literal coverage for the reduced nested-string
cases, the original regression, and valid negated set operations that
remove every string member.
RegExpBuiltinExec used to snap any Unicode lastIndex that landed on a
low surrogate back to the start of the pair. That matched `/😀/u`,
but it skipped valid empty matches when the original low-surrogate
position was itself matchable, such as `/p{Script=Cyrillic}?(?<!\D)/v`
on `"A😘"` and the longer fuzzed global case.
Try the snapped position first, then retry the original lastIndex when
the snapped match fails. Only keep that second result when it is empty
at the original low-surrogate position, so consuming /u and /v matches
still cannot split a surrogate pair. In the Rust VM, treat backward
Unicode matches that start between surrogate halves as having no
complete code point to their left, which matches V8's lookbehind
behavior for those positions.
Add reduced coverage for both low-surrogate exec cases, the original
global match count regression, and the consuming-match retry regression.
Compile the synthetic assertion for negated classes in the same
direction as the surrounding matcher. We were hardcoding a
lookahead for `[^...]`, so lookbehind checked the wrong side of the
current position and missed valid `/v` matches such as
`(?<=[^\p{Emoji}])2`.
Apply the same fix to unicode set classes, since they use the same
negative-lookaround-plus-`AnyChar` lowering for complements. Add
reduced `RegExp.js` coverage for both `[^\p{Emoji}]` and
`[[^\p{Emoji}]]` in lookbehind, plus the original complex `/gv`
regression.
Repeated simple loops like "a+".repeat(100) compile to a chain of
greedy loop instructions. When one loop failed, the VM only knew how
to give back one character at a time unless the next instruction was a
literal char, so V8's regexp-fallback case ran into the backtrack
limit instead of finding the obvious match.
When a greedy simple loop is followed by more loops for the same
matcher, sum their minimum counts and backtrack far enough to leave the
missing suffix in one step. If that suffix is already available to the
right, still give back one character so the VM makes progress instead
of reusing the same greedy state forever.
The RegExp runtime test now covers the Chromium regexp-fallback case
through exec(), global exec(), and both replace() paths, plus bounded
same-matcher chains where the suffix minimum is partly missing or
already available.
The VM only marked patterns as anchored when the first real instruction
was AssertStart. That missed anchors hidden behind capture setup or a
leading positive lookahead, so patterns like /(^bar)/ and /(?=^bar)\w+/
fell back to whole-subject scanning.
Teach the hint analysis to see through the non-consuming wrappers we
emit ahead of a leading ^, but still run the literal prefilters before
anchored and sticky VM attempts. Missing required literals should stay
cheap no-matches instead of running the full backtracking VM and
raising the step limit.
The RegExp runtime test now covers the Chromium ascii-regexp-subject
case on a long ASCII input and anchored, sticky, and global no-match
cases where the required literal is absent.
Browsers clamp braced quantifier bounds above 2^31 - 1 before
checking whether {min,max} is in order. The parser still kept values
up to u32::MAX, so patterns like {2147483648,2147483647} were
rejected even though both bounds should collapse to the same limit.
Clamp parsed braced quantifier bounds to 2^31 - 1 as they are read.
This keeps the existing acceptance of huge exact and open-ended
quantifiers and makes the constructor and regex literal paths agree
with other engines on the out-of-order edge cases.
The RegExp runtime and syntax tests now cover accepted huge
quantifiers, clamped order validation, and huge literal forms. The
reported constructor and literal cases also match other engines.
Unicode set intersection and subtraction were compiled by matching one
operand and then checking the others with lookbehind. That let a
longer string operand reject a shorter match whenever the longer
string happened to end at the same position.
Group unicode set operands by exact match length and compile each
length class separately, longest first. This keeps longest-match
semantics for unions while making intersection and subtraction compare
only strings of the same length. The new RegExp runtime cases cover
both the reported [a-z]--\q{abc} regression and the related
intersection/subtraction mismatches, and they now agree with V8.
Reject surrogate pairs in named group names unless both halves come
from the same raw form. A literal surrogate half was being
normalized into \uXXXX before LibRegex parsed the pattern, which let
mixed literal and escaped forms sneak through.
Validate surrogate handling on the UTF-16 pattern before
normalization, but only treat \k<...> as a named backreference when
the parser would do that too. Legacy regexes without named groups
still use \k as an identity escape, so their literal text must not be
rejected by the pre-scan.
Add runtime and syntax tests for the mixed forms, the valid literal,
fixed-width, and braced escape cases, and the legacy \k literals.
Sticky regular expressions were still using the generic forward-search
paths inside LibRegex and only enforcing the lastIndex check back in
LibJS after a match had already been found. That made tokenizer-style
sticky patterns spend most of their time scanning for later matches
that would be thrown away.
Route sticky exec() and test() through an anchored VM entry point that
runs exactly once at the requested start position while keeping the
existing literal-hint pruning. Add focused test-js coverage for sticky
literals, alternations, classes, quantifiers, and WebIDL-style token
patterns.
Intersection and subtraction chains in unicode sets mode were accepting
bare range operands like `[a-z&&b-y]` and `[a-z--[aeiou]]`. V8 rejects
those forms; the range has to stay inside a nested class before it can
participate in a set operation.
Reject `ClassSetOperand::Range` when building `&&` or `--` expressions,
and extend the runtime regexp tests with the reported invalid patterns
plus an escaped-endpoint range case.
Preserve V8's behavior for bare single-astral literals when a unicode
global search starts in the middle of a surrogate pair. We were
snapping that lastIndex back to the pair start unconditionally,
which let /😀/gu and /\u{1F600}/gu match where V8 returns null.
Expose that literal shape from LibRegex to LibJS and add runtime
coverage for the bare literal case alongside a grouped control.
Track whether a pattern can match empty and only skip interior
surrogate positions when the matcher must consume input. This keeps
the unicode candidate scan fast for consuming searches without
dropping valid zero-width matches such as /\B/ and /(?!...)/ between
a surrogate pair's two code units.
Add runtime coverage for both global lastIndex searches and plain
exec() searches on zero-width unicode patterns.
When a backward greedy loop backtracked toward the right edge of the
input, the optimized scan for a following Char instruction could stop
making progress at end of input and loop forever. This made patterns
like /(?<=a.?)/ hang on non-matching input.
Keep the greedy built-in class fast path aligned with the regular VM
matcher for non-Unicode regexps. Without this, /\w+/i and /\W+/i
wrongly applied Unicode ignore-case behavior in the optimized loop.
The fast path in RegExp.prototype.test() checked for an own "exec"
property on the instance via storage_has(), but did not detect when
RegExp.prototype.exec itself had been replaced. This meant overriding
exec on the prototype was silently ignored, violating the spec which
requires test() to go through RegExpExec() and thus the overridable
exec method.
Fix this by resolving "exec" via a full prototype chain lookup and
checking whether the result is still the built-in exec, matching the
approach already used in Symbol.replace's fast path.
Detect literal tails that lie on the linear success path and reject
matches early when that literal never appears in the remaining input.
This lets /(a+)+b/ fail quickly on long runs of a instead of spending
its backtracking budget proving that the missing b can never match.
Keep the tail analysis cheap while doing this. The new required-literal
hint reuses trailing-literal extraction, so rewrite the linear-tail
check to compute predecessor counts in one pass instead of rescanning
the whole program for every instruction. That keeps large regex parses,
including the large bytestring constructor test, fast.
Add a regression test for ("a".repeat(25)).match(/(a+)+b/), which
should return null without throwing a backtrack-limit error.
Patterns like `^(a|a?)+$` build a split tree where each `a` can be
matched by either alternative. On short non-matching inputs that still
blows through the backtrack limit even though the two alternatives are
semantically equivalent to a single greedy optional matcher.
Detect the narrow case of two single-term alternatives that compile to
the same simple matcher, where one is required and the other is greedy
`?`, and compile them as that single optional term instead of emitting
a disjunction.
Add a String.prototype.match regression for
("a".repeat(25) + "b").match(/^(a|a?)+$/), which should return null
instead of throwing an InternalError.
Backward Char instructions read the code point immediately to the left
of the current position, but the greedy loop backtracking optimization
was scanning for the next literal at the current position itself.
That meant a lookbehind like `(?<=h.*)THIS` never reconsidered the
boundary after `h`, so valid matches were missed.
When the quantified part was allowed to shrink to zero, as in the
reported `(?<!.*q.*?)(?<=h.*)THIS(?=.*!)` pattern, the same
backtracking bug could thrash badly enough to appear hung.
Fix the backward greedy-loop scan to test candidate boundaries against
the code units immediately to their left. Do the same for supplementary
characters by checking the surrogate pair ending at that boundary.
Add String.prototype.match regressions for both the simple greedy
lookbehind and the full reported pattern.
RepeatMatcher retries a quantified atom with its own captures cleared,
but if an additional greedy iteration matches the empty string the
engine must fall back to the pre-iteration state. The fast VM path was
clearing capture registers after backtracking from ProgressCheck,
which meant the restored state from the previous successful iteration
was immediately wiped out.
That showed up with nested quantified captures like
"xyz123xyz".match(/((123)|(xyz)*)*/), where the final empty expansion
of the outer `*` discarded the last non-empty captures and returned
undefined for groups 1 and 4.
The same area also needs to track each zero-width-capable iteration's
start position explicitly. Initializing that state with ProgressCheck
stored the end of the previous repetition instead, which regressed
patterns like `/(a*)*/` by letting an empty iteration commit `""`
into the capture instead of falling back to the pre-iteration state
with an undefined capture.
Clear captures before backtracking from a rejected empty iteration,
and save iteration starts before entering quantified bodies so
ProgressCheck only decides whether that iteration made progress.
Add regressions for the reported nested quantified capture case and
for `/(a*)*/.exec("b")`, which should leave the capture undefined.
Snapshot registers for GreedyLoop and LazyLoop backtrack states so
failed alternatives cannot leak capture mutations into an older loop
choice point.
Before this change, those optimized states only restored the input
position and active modifiers. If a later branch changed capture
registers before failing, revisiting an earlier loop state reused
the stale captures instead of the state that was current when the
loop state was pushed.
That let /^(b+|a){1,2}?bc/ on "bbc" produce an invalid group 1 range
with start 2 and end 1, which later tripped UBSan while
RegExp.prototype.exec materialized the match result.
Add a RegExp.prototype.exec regression for this pattern so we keep
the expected ["bbc", "b"] result covered.
Switch LibJS `RegExp` over to the Rust-backed `ECMAScriptRegex` APIs.
Route `new RegExp()`, regex literals, and the RegExp builtins through
the new compile and exec APIs, and stop re-validating patterns with the
deleted C++ parser on the way in. Preserve the observable error
behavior by carrying structured compile errors and backtracking-limit
failures across the FFI boundary. Cache compiled regex state and named
capture metadata on `RegExpObject` in the new representation.
Use the new API surface to simplify and speed up the builtin paths too:
share `exec_internal`, cache compiled regex pointers, keep the legacy
RegExp statics lazy, run global replace through batch `find_all`, and
optimize replace, test, split, and String helper paths. Add regression
tests for those JavaScript-visible paths.
Add LibRegex's new Rust ECMAScript regular expression engine.
Replace the old parser's direct pattern-to-bytecode pipeline with a
split architecture: parse patterns into a lossless AST first, then
lower that AST into bytecode for a dedicated backtracking VM. Keep the
syntax tree as the place for validation, analysis, and optimization
instead of teaching every transformation to rewrite partially built
bytecode.
Specialize this backend for the job LibJS actually needs. The old C++
engine shared one generic parser and matcher stack across ECMA-262 and
POSIX modes and supported both byte-string and UTF-16 inputs. The new
engine focuses on ECMA-262 semantics on WTF-16 data, which lets it
model lone surrogates and other JavaScript-specific behavior directly
instead of carrying POSIX and multi-encoding constraints through the
whole implementation.
Fill in the ECMAScript features needed to replace the old engine for
real web workloads: Unicode properties and sets, lookahead and
lookbehind, named groups and backreferences, modifier groups, string
properties, large quantifiers, lone surrogates, and the parser and VM
corner cases those features exercise.
Reshape the runtime around compile-time pattern hints and a hotter VM
loop. Pre-resolve Unicode properties, derive first-character,
character-class, and simple-scan filters, extract safe trailing
literals for anchored patterns, add literal and literal-alternation
fast paths, and keep reusable scratch storage for registers,
backtracking state, and modifier stacks. Teach `find_all` to stay
inside one VM so global searches stop paying setup costs on every
match.
Make those shortcuts semantics-aware instead of merely fast. In Unicode
mode, do not use literal fast paths for lone surrogates, since
ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair.
Likewise, only derive end-anchor suffix hints when the suffix lies on
every path to `Match`, so lookarounds and disjunctions cannot skip into
a shared tail and produce false negatives.
This commit lands the Rust crate, the C++ wrapper, the build
integration, and the initial LibJS-side plumbing needed to exercise
the new engine under real RegExp callers before removing the legacy
backend.
Add scripts for importing the V8 and WebKit regexp suites into
`Tests/LibJS/Runtime/3rdparty/` and commit the imported tests.
Also update the local harness helpers so these suites can run under
LibJS. In particular, teach the assertion shims to compare RegExp values
the way the imported tests expect and to treat `new Function(...)`
throwing as a valid `assertThrows` case.
This gives the regex rewrite a large bank of external conformance tests
that exercise parser and matcher behavior beyond in-tree coverage.
Dictionary shapes are mutable (properties added/removed in-place via
add_property_without_transition), so sharing them between objects via
the NewObject premade shape cache is unsafe.
When a large object literal (>64 properties) is created repeatedly in
a loop, the first execution transitions to a dictionary shape, which
CacheObjectShape then caches. Subsequent iterations create new objects
all pointing to the same dictionary shape. If any of these objects adds
a new property, it mutates the shared shape in-place, increasing its
property_count, but only grows its own named property storage. Other
objects sharing the shape are left with undersized storage, leading to
a heap-buffer-overflow when the GC visits their edges.
Fix this by not caching dictionary shapes. This means object literals
with >64 properties won't get the premade-shape fast path, but such
literals are uncommon.
Noticed this pattern when reading some minified JS while debugging a
seemingly unrelated problem and immediately got suspicious because of my
earlier, similar fixes.