invalidate_all_prototype_chains_leading_to_this used to scan every
prototype shape in the realm and walk each one's chain looking for
the mutated shape. That was O(N_prototype_shapes x chain_depth) per
mutation and showed up hot in real profiles when a page churned a
lot of prototype state during startup.
Each prototype shape now keeps a weak list of the prototype shapes
whose immediate [[Prototype]] points at the object that owns this
shape. The list is registered on prototype-shape creation
(clone_for_prototype, set_prototype_shape) and migrated to the new
prototype shape when the owning prototype object transitions to a
new shape. Invalidation is then a recursive walk over this direct-
child registry, costing O(transitive descendants).
Saves ~300 ms of main thread time when loading https://youtube.com/
on my Linux machine. :^)
No other engine defines this function, so it is an observable difference
of our engine. This traces back to the earliest days of LibJS.
We now define `gc` in just the test-js and test262 runners.
indexed_take_first() already memmoves elements down for both Packed and
Holey storage, but the caller at ArrayPrototype::shift() only entered
the fast path for Packed arrays. Holey arrays fell through to the
spec-literal per-element loop (has_property / get / set /
delete_property_or_throw), which is substantially slower.
Add a separate Holey predicate with the additional safety checks the
spec semantics require: default_prototype_chain_intact() (so
HasProperty on a hole doesn't escape to a poisoned prototype) and
extensible() (so set() on a hole slot doesn't create a new own
property on a non-extensible object). The existing Packed predicate
is left unchanged -- packed arrays don't need these checks because
every index in [0, size) is already an own data property.
Allows us to fail at Cloudflare Turnstile way much faster!
The element-by-element loop compiled to scalar 8-byte moves that the
compiler could not vectorize: source and destination alias, and strict
aliasing prevented hoisting the m_indexed_elements pointer load out of
the loop body. memmove collapses the shift into a single vectorized
copy.
Previously it used `realm.[[GlobalObject]]` instead of
`realm.[[GlobalEnv]].[[GlobalThisValue]]`.
In LibWeb, that corresponds to Window and WindowProxy respectively.
Shape::visit_edges used to walk every entry of m_property_table and
call PropertyKey::visit_edges on each key. For non-dictionary shapes
that work is redundant: m_property_table is a lazily built cache of
the transition chain, and every key it contains was originally
inserted as some ancestor shape's m_property_key, which is already
kept alive via m_previous.
Intrinsic shapes populated through add_property_without_transition()
in Intrinsics.cpp are not dictionaries and have no m_previous to
reach their keys through, but each of those keys is either a
vm.names.* string or a well-known symbol and is strongly rooted by
the VM for its whole lifetime, so skipping them here is safe too.
Measured on the main WebWorker used by https://www.maptiler.com/maps/
this cuts out ~98% of the PropertyKey::visit_edges calls made by
Shape::visit_edges each GC, reducing time spent in GC by ~1.3 seconds
on my Linux PC while initially loading the map.
Carry full source positions through the Rust bytecode source map so
stack traces and other bytecode-backed source lookups can use them
directly.
This keeps exception-heavy paths from reconstructing line and column
information through SourceCode::range_from_offsets(), which can spend a
lot of time building SourceCode's position cache on first use.
We're trading some space for time here, but I believe it's worth it at
this tag, as this saves ~250ms of main thread time while loading
https://x.com/ on my Linux machine. :^)
Reading the stored Position out of the source map directly also exposed
two things masked by the old range_from_offsets() path: a latent
off-by-one in Lexer::new_at_offset() (its consume() bumped line_column
past the character at offset; only synthesize_binding_pattern() hit it),
and a (1,1) fallback in range_from_offsets() that fired whenever the
queried range reached EOF. Fix the lexer, then rebaseline both the
bytecode dump tests (no more spurious "1:1") and the destructuring AST
tests (binding-pattern identifiers now report their real columns).
CompareArrayElements was calling ToString(x) +
PrimitiveString::create(vm, ...) on every comparison, producing a
fresh PrimitiveString that wrapped the original's AK::String but
carried no cached UTF-16. The subsequent IsLessThan then hit
PrimitiveString::utf16_string_view() on that fresh object, which
re-ran simdutf UTF-8 validation + UTF-8 -> UTF-16 conversion for
both sides on every one of the N log N comparisons.
When x and y are already String Values, ToString(x) and
ToPrimitive(x, Number) are the identity per spec, so we can drop
the IsLessThan detour entirely and compare their Utf16Views
directly. The original PrimitiveString caches its UTF-16 on first
access, so subsequent comparisons against the same element hit
the cache; Utf16View::operator<=> additionally gives us a memcmp
fast path when both sides ended up with short-ASCII UTF-16 storage.
Microbenchmark:
```js
function makeStrings(n) {
let seed = 1234567;
const rand = () => {
seed = (seed * 1103515245 + 12345) & 0x7fffffff;
return seed;
};
const out = new Array(n);
for (let i = 0; i < n; i++)
out[i] = "item_" + rand().toString(36)
+ "_" + rand().toString(36);
return out;
}
const base = makeStrings(100000);
const arr = base.slice();
arr.sort();
```
```
n before after speedup
1k 0.70ms 0.30ms 2.3x
10k 8.33ms 3.33ms 2.5x
50k 49.33ms 17.33ms 2.8x
100k 118.00ms 45.00ms 2.6x
```
Generator::allocate_register used to scan the free pool to find the
lowest-numbered register and then Vec::remove it, making every
allocation O(n) in the size of the pool. When loading https://x.com/
on my Linux machine, we spent ~800ms in this function alone!
This logic only existed to match the C++ register allocation ordering
while transitioning from C++ to Rust in the LibJS compiler, so now
we can simply get rid of it and make it instant. :^)
So drop the "always hand out the lowest-numbered free register" policy
and use the pool as a plain LIFO stack. Pushing and popping the back
of the Vec are both O(1), and peak register usage is unchanged since
the policy only affects which specific register gets reused, not how
aggressively.
GetCalleeAndThisFromEnvironment treated a binding as initialized when
its value slot was not <empty>. Declarative bindings do not encode TDZ
in that slot, though: uninitialized bindings keep a separate initialized
flag and their value starts as undefined.
That let the first slow-path TDZ failure populate the environment cache,
then a second call at the same site reused the cached coordinate and
turned the required ReferenceError into a TypeError from calling
undefined.
Check Binding.initialized in the asm fast path instead and cover the
cached second-hit case with a regression test.
WebAssembly.Memory({shared:true}).grow() reallocates the underlying
AK::ByteBuffer outline (kmalloc+kfree) but, per the threads proposal,
must not detach the associated SharedArrayBuffer.
ArrayBuffer::detach_buffer was the only path that walked m_cached_views
and cleared the cached raw m_data pointer on each TypedArrayBase, so
every existing view retained a dangling pointer into the freed outline.
The AsmInterpreter GetByValue / PutByValue fast paths dereference that
cached pointer directly, yielding a use-after-free triggerable from
JavaScript.
Add ArrayBuffer::refresh_cached_typed_array_view_data_pointers() which
re-derives m_data for each registered view from the current outline
base (and refreshes UnownedFixedLengthByteBuffer::size), and call it
from Memory::refresh_the_memory_buffer on the SAB-fixed-length path
where detach is spec-forbidden.
The asm interpreter already inlines ECMAScript calls, but builtin calls
still went through the generic C++ Call slow path even when the callee
was a plain native function pointer. That added an avoidable boundary
around hot builtin calls and kept asm from taking full advantage of the
new RawNativeFunction representation.
Teach the asm Call handler to recognize RawNativeFunction, allocate the
callee frame on the interpreter stack, copy the call-site arguments,
and jump straight to the stored C++ entry point.
NativeJavaScriptBackedFunction and other non-raw callees keep falling
through to the existing C++ slow path unchanged.
Mapped and unmapped arguments objects already use dedicated premade
shapes. Track their [[ParameterMap]] internal slot there instead of
setting an Object flag after construction.
This keeps the information on the shared shape, preserves it through
shape transitions, and still lets Object.prototype.toString()
recognize arguments objects without per-instance bookkeeping.
NativeFunction previously stored an AK::Function for every builtin,
even when the callable was just a plain C++ entry point. That mixed
together two different representations, made simple builtins carry
capture storage they did not need, and forced the GC to treat every
native function as if it might contain captured JS values.
Introduce RawNativeFunction for plain NativeFunctionPointer callees
and keep AK::Function-backed callables on a CapturingNativeFunction
subclass. Update the straightforward native registrations in LibJS
and LibWeb to use the raw representation, while leaving exported
Wasm functions on the capturing path because they still capture
state.
Wrap UniversalGlobalScope's byte-length strategy lambda in
Function<...> explicitly so it keeps selecting the capturing
NativeFunction::create overload.
First: We now pin the icu4x version to an exact number. Minor version
upgrades can result in noisy deprecation warnings and API changes which
cause tests to fail. So let's pin the known-good version exactly.
This patch updates our Rust calendar module to use the new APIs. This
initially caused some test failures due to the new Date::try_new API
(which is the recommended replacement for Date::try_new_from_codes)
having quite a limited year range of +/-9999. So we must use other
APIs (Date::try_from_fields and calendrical_calculations::gregorian)
to avoid these limits.
http://github.com/unicode-org/icu4x/blob/main/CHANGELOG.md#icu4x-22
Place Realm's cached declarative environment next to its global object
so the asm global access fast paths can fetch the two pointers with a
paired load. These handlers never use the intervening GlobalEnvironment
pointer directly.
Mirror Executable's constants size and data pointer in adjacent fields
so the asm Call fast path can pair-load them together. The underlying
Vector layout keeps size and data apart, so a small cached raw span
lets the hot constant-copy loop fetch both pieces of metadata at once.
Load PropertyNameIterator's indexed-property count and next index
together when stepping the fast path. Keeping the paired count live
into the named-property case also avoids reloading it before computing
the flattened index.
Load PropertyNameIterator's cached property cache and shape snapshot
together before validating the receiver shape. The two fields already
sit adjacent in the object layout, so the fast path can fetch both
without any extra reshuffling.
Load EnvironmentCoordinate::hops and ::index together in the asm
environment-walk helper. The pair-load keeps the DSL explicit about
which two fields travel together and removes another scalar metadata
fetch from the fast path.
Load the cached property offset and dictionary generation with paired
loads in the property inline-cache fast paths. AsmIntGen now verifies
these reads against the actual cache layout, so the DSL keeps both
fields named and self-documenting.
Load the cached shape and prototype pointer together in the property
inline-cache fast paths that already read both. This keeps the
cache-entry metadata fetches aligned with the DSL's paired-load model
without changing the surrounding control flow.
Load the inline frame's return pc and destination register at once when
Return or End resumes an asm-managed caller. This keeps the unwind
metadata with the helper that consumes it and removes a separate scalar
load from both handlers.
The asm Call fast path was still reloading the executable pointer while
building the inline callee frame, even though it had already loaded the
same pointer while validating the call target.
Carry that executable pointer through frame setup and reload the passed
argument count from the call bytecode instead of the fresh frame header.
This trims a couple more loads from the hot path.
Pack the asm Call fast path metadata next to the executable pointer
so the interpreter can fetch both values with one paired load. This
removes several dependent shared-data loads from the hot path.
Keep the executable pointer and packed metadata in separate registers
through this binding so the fast path can still use the paired-load
layout after any non-strict this adjustment.
Lower the packed metadata flag checks correctly on x86_64 as well.
Those bits now live above bit 31, so the generator uses bt for single-
bit high masks and covers that path with a unit test.
Add a runtime test that exercises both object and global this binding
through the asm Call fast path.
Executable already caches the combined registers, locals, and constants
count that the asm Call fast path needs for inline frame allocation.
Use that precomputed total instead of rebuilding it from the registers
count and constants vector size in the hot path.
The asm Call fast path already checks SharedFunctionInstanceData's
cached can_inline_call bit before touching the executable pointer.
That cache is only true for ordinary functions with compiled bytecode,
so the extra executable null check is redundant work in the hot path.
The asm Call fast path reads InterpreterStack::m_top and m_limit
back-to-back while checking whether the inline callee frame fits.
Those fields are adjacent, so we can load them together with one
paired load and keep the stack-size check otherwise unchanged.
Teach the asm Call fast path to use paired stores for the fixed
ExecutionContext header writes and for the caller linkage fields.
This also initializes the five reserved Value slots directly instead
of looping over them as part of the general register clear path.
That keeps the hot frame setup work closer to the actual data layout:
reserved registers are seeded with a couple of fixed stores, while the
remaining register and local slots are cleared in wider chunks.
On x86_64, keep the new explicit-offset formatting on store_pair*
and load_pair* without changing ordinary [base, index, scale]
operands into base-plus-index-plus-offset addresses. Add unit
tests covering both the paired zero-offset form and the preserved
scaled-index lowering.
Teach AsmIntGen about store_pair32 and store_pair64 so hot handlers
can describe adjacent writes just as explicitly as adjacent reads.
The DSL now requires naming both memory operands and rejects
non-adjacent or reordered pairs at code generation time.
On aarch64 the new instructions lower to stp when the address is
encodable, while x86_64 keeps the same semantics with two scalar
stores. The shared validation keeps the paired access rules consistent
across both load and store primitives.
Use the new paired-load DSL operations in the inline Call path for the
adjacent environment, ScriptOrModule, caller metadata, and callee-entry
loads. The flow stays the same, but the hot call setup now needs fewer
scalar memory operations on aarch64.
Add load_pair32 and load_pair64 to the AsmInt DSL and make the
generator verify that both named memory operands are truly adjacent.
That keeps paired loads self-documenting in the DSL instead of
hiding the second field behind an implicit adjacency assumption.
AArch64 now lowers valid pairs to ldp when the address form allows
it, while x86_64 keeps the same behavior with two obvious scalar loads.
Add unit tests for the shared validator so reversed or non-adjacent
field pairs are rejected during code generation.
Handle inline-eligible JS-to-JS Call directly in asmint.asm instead
of routing the whole operation through AsmInterpreter.cpp.
The asm handler now validates the callee, binds `this` for the
non-allocating cases, reserves the callee InterpreterStack frame,
populates the ExecutionContext header and Value tail, and enters the
callee bytecode at pc 0.
Keep the cases that need NewFunctionEnvironment() or sloppy `this`
boxing on a narrow helper that still builds an inline frame. This
preserves the existing inline-call semantics for promise-job ordering,
receiver binding, and sloppy global-this handling while keeping the
common path in assembly.
Add regression coverage for closure-capturing callees, sloppy
primitive receivers, and sloppy undefined receivers.
Emit the ExecutionContext, function-object, executable, and realm
offsets that the asm Call path needs to inspect and initialize
directly when building inline frames.
Teach the AArch64 AsmInt generator to materialize immediates through
w-register writes when the upper 32 bits are known zero.
That keeps the same x-register value while letting common constants
use shorter instruction sequences.
Teach the AArch64 AsmInt generator to lower zero-immediate stores
through xzr or wzr instead of materializing a temporary register.
This covers store64 as well as the narrow store8, store16, and
store32 forms, keeping the generated code shorter on the zero
store fast path.
AsmIntGen already lowers branch_zero and branch_nonzero to the compact
AArch64 branch-on-bit forms when possible, but branch_bits_set and
branch_bits_clear still expanded single-bit immediates into tst plus a
separate conditional branch.
Teach the AArch64 backend to recognize power-of-two masks and emit
tbnz or tbz directly. This shortens several hot interpreter paths.
Handle Return and End entirely in AsmInt when leaving an inline frame.
The handlers now restore the caller, update the interpreter stack
bookkeeping directly, and bump the execution generation without
bouncing through AsmInterpreter.cpp.
Add WeakRef tests that exercise both inline Return and inline End
so this path stays covered.
Add load_vm, memory-operand macro substitution, and a generic
inc32_mem instruction to the AsmInt DSL.
Also drop redundant mov reg, reg copies in the backends so handlers
that use the new helpers expand to cleaner assembly.
Store whether a function can participate in JS-to-JS inline calls on
SharedFunctionInstanceData instead of recomputing the function kind,
class-constructor bit, and bytecode availability at each fast-path
call site.
The callee and this-value preservation copies only matter while later
argument expressions are still being evaluated. For zero-argument calls
there is nothing left to clobber them, so we can keep the original
operand and let the interpreter load it directly.
This removes the hot Mov arg0->reg pattern from zero-argument local
calls and reduces register pressure.
Teach the Rust bytecode generator to treat the synthetic entry
GetLexicalEnvironment as a removable prologue load.
We still model reg4 as the saved entry lexical environment during
codegen, but assemble() now deletes that load when no emitted
instruction refers to the saved environment register. This keeps the
semantics of unwinding and environment restoration intact while letting
empty functions and other simple bodies start at their first real
instruction.
Keep JS-to-JS inline calls out of m_execution_context_stack and walk
the active stack from the running execution context instead. Base
pushes now record the previous running context so duplicate
TemporaryExecutionContext pushes and host re-entry still restore
correctly.
This keeps the fast JS-to-JS path off the vector without losing GC
root collection, stack traces, or helpers that need to inspect the
active execution context chain.
The bytecode interpreter only needed the running execution context,
but still threaded a separate Interpreter object through both the C++
and asm entry points. Move that state and the bytecode execution
helpers onto VM instead, and teach the asm generator and slow paths to
use VM directly.