Skip the generic HasProperty and Get loop when indexOf operates on a
simple packed array. In that case every index below length is an own
data property, so a direct scan of the packed indexed property storage
gives the same strict-equality result without the per-element property
lookup
ceremony.
Only use the fast path when the current packed storage size still
matches the length captured before fromIndex coercion, since that
coercion can run user code and mutate the receiver. Add coverage for
length and storage mutations during fromIndex coercion.
The parser only set `might_need_arguments_object` when an `arguments`
or `eval` Identifier went through `consume()`, but shorthand object
properties create the reference via `make_identifier()` directly. As
a result `function f() { return { arguments } }` allocated an
`arguments` local, never initialized it, and crashed at runtime when
the property was read.
Fall back to scope-driven detection: if scope analysis allocated a
non-lexical `arguments` local for the function, treat it as a real
arguments-object reference and emit `CreateArguments`. Skip the
fallback when a function declaration named `arguments` claims the
local, since that local belongs to the function, not the arguments
object.
Add a runtime test covering shorthand inside a free function and a
method, plus a regression test for `({ eval } = ...)` to confirm
destructuring assignment doesn't accidentally trigger arguments
materialization.
The scope collector stored identifier_groups and variables in
HashMaps and then sorted them alphabetically before assigning local
register indices. The sorts existed only because HashMap iteration
order is non-deterministic; alphabetical was a stable choice for
comparing bytecode against the now-removed C++ port.
Switch both maps to indexmap::IndexMap so iteration follows the order
of first reference (= source order), and drop the alphabetical sorts.
Local indices now reflect declaration order, which matches what shows
up in bytecode dumps and is easier to read alongside the source.
Add a focused bytecode test using zebra/yak/aardvark to pin the new
allocation order; existing tests using let/var declarations have
their local indices renumbered to match.
ECMAScript hoisting keeps the LAST function declaration with a given
name. The Rust scope_collector and script GDI extraction implemented
this with a single reverse scan that pushed first-seen entries, which
left the resulting list in REVERSE source order. The C++ side then
iterated `m_functions_to_initialize.in_reverse()` to undo that.
Switch the Rust side to a two-pass forward scan that records the last
position per name and emits entries in source order, and drop the
matching `.in_reverse()` calls in Script.cpp and AbstractOperations.cpp.
Same hoisting semantics; NewFunction emission and global property
iteration order now follow the source.
The HashMap that tracks last positions is keyed on `SharedUtf16String`,
so each insert is a refcount bump on the AST's existing Rc instead of
a deep `Vec<u16>` clone.
Add bytecode tests at script and nested-function scope that exercise
multiple declarations and a duplicate name to pin the new ordering.
The Rust parser used to copy several "rule_start"-derived positions
from the C++ implementation: every identifier inside a binding pattern
inherited the pattern's `[`/`{` position, every property identifier
after `.` inherited the period's position, every spread element
inherited the surrounding `[`/`{` position, and identifier-name
property keys inherited the object/class start position. This was
useful while comparing bytecode against the C++ port; with the C++
side gone, those quirks just hide the actual source positions in
source maps and devtools.
Drop the dedicated `binding_pattern_start` parser field and the
`ident_pos_override` parameter on `parse_property_key`, and capture
each identifier's own start position at the consume site.
Add an AST snapshot test that pins the new per-identifier positions
for object, array, nested, and parameter binding patterns.
The parser used to suppress the arguments/eval reference check via a
state flag that was set during the entire `parse_property_key` call.
That was over-broad: identifiers inside a computed property key like
`{ [arguments]: 1 }` are real references, but the flag silenced their
check too, leaving the function unmarked as needing the arguments
object. Reading the resulting property at runtime crashed.
Replace the flag with a `consume_property_key_token()` method used at
the specific consume sites for the property key token itself, so the
suppression is narrow. Inner consumes inside computed keys now go
through regular `consume()` and run the check normally.
Add a focused AST snapshot test covering plain, shorthand, computed,
binding-pattern, and method-name property-key cases.
Mark direct calls to function expressions while generating top-level
Rust bytecode, then compile those functions before returning the
off-thread compilation result to WebContent.
The main thread still performs all VM and GC-backed materialization. It
now receives an already assembled executable for each eager IIFE and
attaches it to the SharedFunctionInstanceData while creating the parent
Executable. Nested functions owned by the eager executable remain lazy.
This targets large wrapper IIFEs that are invoked as soon as top-level
code starts running. Their bytecode generation now runs on the existing
script compilation worker instead of blocking the main thread on first
call.
Split Rust program compilation so code generation and assembly finish
before the main thread materializes GC-backed executable objects. The
new CompiledProgram handle owns the parsed program, generator state, and
bytecode until C++ consumes it on the main thread.
Wire WebContent script fetching through that handle for classic scripts
and modules. Syntax-error paths still return ParsedProgram, so existing
error reporting stays in place. Successful fetches now do top-level
codegen on the thread pool before deferred_invoke hands control back to
the main thread.
Executable creation, SharedFunctionInstanceData materialization, module
metadata extraction, and declaration data extraction still run on the
main thread where VM and GC access is valid.
Rust bytecode generation still reached into the VM to encode well-known
symbols and intrinsic abstract-operation functions as raw JS::Value
constants. That is not compatible with running top-level code generation
away from the main thread.
Keep those constants symbolic in the Rust constant pool instead. The C++
Executable materialization step now resolves them into real VM values
while it is already decoding the rest of the constant table on the main
thread.
This removes another VM dependency from Rust bytecode emission without
changing when the resulting constants become visible to the bytecode
interpreter.
Rust bytecode generation currently creates SharedFunctionInstanceData
and ClassBlueprint GC objects as soon as nested functions and classes
are encountered. That keeps the whole code generation phase tied to
the main-thread VM and heap.
Record pending descriptors on the Generator instead, then materialize
those descriptors while creating the C++ Executable. This keeps the GC
allocation boundary exactly where it already belongs, but removes the
last direct function-data allocations from the codegen walk.
This is a preparatory step for compiling top-level bytecode off-thread
and only doing C++ materialization after returning to the main thread.
FunctionTable::extract_reachable() used to rediscover a function's
nested functions by walking the full body and parameter list during
bytecode generation. This is hot during page loading because creating
every lazy SFD pays for an extra structural AST traversal.
Record each parser-created function's direct child function ids while
parsing instead. Extraction can then recursively move that known
subtree without scanning the enclosing function again.
Keep the old structural scan for codegen-synthesized wrappers, such as
class field initializers, where no parser function context exists.
This preserves the sparse FunctionTable storage while making the common
extraction path proportional to the nested function count.
The named 3-operand call_helper form lets the allocator see the
helper input and output as distinct temporaries. On aarch64 those
values do not overlap: the input dies at the call boundary, and the
output is born from the return value.
Pin both temporaries to x0, which is both the first AAPCS64 argument
register and the return register. This lets the aarch64 codegen omit
the old mov x0, x1 bridge before named call_helper uses, while
leaving the legacy 1-operand convention alone.
Add an allocator test for the aarch64 pinning so the calling-convention
intent stays explicit.
The greedy register allocator already preferred low-numbered GPRs by
virtue of the pool being listed in cost order, but the policy was
implicit and the temp-processing order was driven purely by live-range
length. That meant a long-lived but rarely-referenced temp could grab
`rax` and push a hot temp into a more expensive register.
Make the encoding-cost policy explicit and route hot temps to cheap
registers:
* `register_cost` in registers.rs scores each physical register by
encoding cost. `rax`/`eax` is cheapest because of the accumulator-
form short encodings (`add eax, imm32` is one byte shorter than
`add r/m32, imm32`). The other classic GPRs avoid the REX prefix
in 32-bit forms. `r8`..`r15` always need a REX extension byte.
aarch64 is cost-uniform.
* The allocator now sorts named temps by use count first (so hot
temps grab cheap registers), with live-range length as a
tiebreaker. Greedy graph coloring is order-sensitive, so the
`Call` handler -- which packs 51 temps into 9 registers -- needs
the live-range-first order to color at all. We try cost-first and
fall back to fit-first only when the first attempt cannot color.
* Register selection now picks the cheapest available register
explicitly via `min_by_key(register_cost)` instead of relying on
pool position. Pool order breaks ties for determinism.
asmint.asm no longer references any positional temp register name --
every handler and macro declares its temporaries by name with `temp` /
`ftemp` and lets the register allocator place them. Migrate the last
two macros holding out:
* dispatch_current uses a macro-local `opcode` temp for the load8 +
indirect jmp.
* pop_inline_frame_and_resume names its return-pc, dst-index, value-
address, vm-pointer, and executable temps explicitly.
With nothing left referring to the positional aliases, drop the
tN / ftN -> physical-register fallback from registers::resolve_register
and update the DSL reference comments at the top of asmint.asm and in
main.rs to describe the named-temp model. The two pre-existing codegen
tests that probed the old positional behavior get rewritten to use the
post-allocation physical-register names directly, since that is now
the actual contract of resolve_op.
The Call handler is the asm interpreter's hot path for inline JS-to-JS
dispatch, and it carries roughly fifty distinct named values across
its body (callee object, packed metadata word, formal/passed/total
counts, two stack-side pointers, the argument-copy cursor, the
script-or-module pair-load scratch, the native return value and
variant tag, ...).
The native-exception path adds a small `mov helper_arg, native_return`
bridge between call_raw_native (output pinned to rax by
fixed_operands) and call_helper (input pinned to rcx by
fixed_operands), since the two ABIs land their values in different
registers and the explicit mov is the cleanest way to express the
hand-off.
Convert the for-in iteration fast path to named temps. The handler
threads ~25 distinct values through its body (the iterator value,
its tag, the unboxed iterator, the fast-path discriminator, the
property cache pointer, the cached and current shapes, the receiver,
the dictionary-generation pair, the indexed-count and next-indexed
counters, the named-index, the named-key data pointer, and a few
scratch slots), so the migrated form is much easier to read than the
positional spaghetti where t0..t8 each meant five different things in
different basic blocks.
Convert the array / typed-array load fast paths in GetByValue, the
mirror of PutByValue. Roughly the same shape: ~18 named GPR temps
and a slot_dbl FPR temp covering the loaded value, the bytes-or-int
raw integer view, the address scratch, and the negative-zero compare.
The .ta_float32 / .ta_float64 paths become readable: where the
positional code had "Exclude negative zero early (t1 gets clobbered
by double_to_int32)" because t1 had to do double duty, the migrated
form names slot, slot_dbl, neg_zero, and raw separately and the
allocator lays them out without the implicit clobber dance.
Convert the array / typed-array store fast paths in PutByValue --
the largest single handler so far at 150+ positional t-uses -- to
declare its 19 GPR temps and 1 FPR temp by name (kind, base, prop,
index, obj, flags, storage_kind, size, elements, src, capacity_addr,
capacity, slot, empty_tag, kind_byte, addr, src_int32, max, result,
src_dbl, plus base_tag/prop_tag scratches reused across paths).
The migrated body removes the "save in t0 before load_operand
clobbers it" / "compute store address before check_is_double clobbers
t4" maneuvers that the positional code needed -- the allocator
handles those constraints automatically.
Convert the load_primitive_string_utf16_code_unit macro to take its
inputs (string pointer, index) and output (code_unit) as explicit
parameters, plus migrate CallBuiltinStringFromCharCode,
CallBuiltinStringPrototypeCharCodeAt, and
CallBuiltinStringPrototypeCharAt.
The macro previously hard-coded inputs to t2/t4 and output to t0,
which forced callers to remember those slot assignments and to wrap
the call_helper interactions around them. The migrated form -- where
the caller names what it wants in each slot -- removes another
"clobbers t3, t5" comment.
Convert the global variable IC and the array-length fast path to
named temps. These three handlers are large -- GetGlobal and
SetGlobal carry ~20 named values across each path (realm pointer,
global object, declarative environment, cache pointer, two serial
numbers, the shape, the cached shape, the dictionary generations,
the property offset, the named-properties pointer, the loaded value
and tag, plus environment-binding state) -- and the migration makes
those values easy to read instead of having to remember which slot
each piece of state was occupying.
GetLength's "magical" int-to-double widening picks up a named
temp for the sign-bit check and the result, which removes another
"NB: load_operand clobbers t0" comment.
Convert the inline-cache fast paths for property load/store to named
temps. Both handlers carry a substantial amount of state across the
ic-hit check (the boxed base value, its tag, the unboxed Object*,
the shape, the PLC pointer, the cached shape and prototype, the
property offset, the current and cached dictionary generations, the
named-properties pointer, and the loaded value), so giving each
piece an explicit name removes a real readability burden. The IC-miss
path uses the new 3-operand call_interp form.
Notably, PutById's old code carried a "save property offset in t4
before load_operand clobbers t0 (rax)" maneuver -- that is exactly
the kind of cross-cut the allocator handles automatically once the
offset is named. The migrated handler just keeps using `prop_offset`
across the load_operand and the store, no scratch shuffle needed.
Convert the validate_callee_builtin macro and the Math.abs / floor /
ceil / sqrt / exp / and ResolveThisBinding handlers to named DSL
temporaries. Math.exp uses the 3-operand call_helper form; the
allocator pins its `arg` to t1 (rcx) and `result` to t0 (rax)
automatically.
Convert the Div and Mod handlers to named DSL temporaries. Mod is the
first migrated handler that uses divmod, with its quot/rem pinned to
rax/rdx by fixed_operands and its dividend/divisor kept off both by
the operand-vs-implicit-output interference rule.
Convert the walk_env_chain macro to take its outputs (target_env,
bind_index) as explicit parameters, then migrate every handler that
walks the environment chain: GetBinding, GetInitializedBinding,
InitializeLexicalBinding, SetLexicalBinding, and
GetCalleeAndThisFromEnvironment.
The macro's internal scratch (the EnvironmentCoordinate address, the
hops counter, the invalid-coord sentinel, and the screwed-by-eval
flag) is now declared as macro-local temps that the allocator places
without touching whatever the caller has live. Each handler picks
its own names for env / idx / binding / value, instead of remembering
that the macro happens to leave them in t3 and t2.
Convert coerce_to_int32s and bitwise_op to take operand temps as
explicit parameters, then migrate UnaryPlus.
The bitwise_op macro is the actual body of BitwiseAnd, BitwiseOr, and
BitwiseXor (which just call into it), so a single macro migration
covers all three handlers. UnaryPlus is migrated alongside since it
shares the same coerce-to-int32 conceptual path.
Convert the equality dispatch chain to take operand temps as explicit
parameters: equality_same_tag, double_equality_compare,
strict_equality_core, and loose_equality_core. Then migrate the
StrictlyEquals, StrictlyInequals, LooselyEquals, LooselyInequals,
JumpStrictlyEquals, JumpStrictlyInequals, JumpLooselyEquals, and
JumpLooselyInequals handlers.
These macros are tightly coupled -- strict_equality_core and
loose_equality_core both invoke equality_same_tag and reach
double_equality_compare via .double_compare, a label that crosses
macro boundaries. The existing label-uniquification rule (only
self-contained labels are renamed) keeps that contract working
without further plumbing.
Convert the heavy comparison and arithmetic macros to take their
operand temps as explicit parameters, then migrate every handler that
uses them: coerce_to_doubles, numeric_compare, numeric_compare_coerce,
boolean_result_epilogue, jump_binary_epilogue, plus the Add, Sub, Mul,
LessThan, LessThanEquals, GreaterThan, GreaterThanEquals,
JumpLessThan, JumpGreaterThan, JumpLessThanEquals, and
JumpGreaterThanEquals handlers.
Convert check_is_double, check_both_double, and box_double_or_int32 to
declare their internal scratch as macro-local `temp` decls instead of
hardcoding t3/t4. Each invocation now gets a freshly uniquified temp
that the allocator places independently of any other live values in
the caller, so adding a `temp` declaration around a sequence that
calls into one of these macros no longer needs to know which
positional slot the macro happened to be using.
box_double_or_int32 in particular is on the hot arithmetic path
(Add/Sub/Mul/PostfixIncrement/etc.), so this opens the door to
migrating those handlers without conflicting with the macro's
internal scratch use.
Convert another batch of macro-free handlers to named DSL temporaries.
The shift handlers in particular are nice exercises of the allocator's
fixed-operand support: x86's `shl`/`shr`/`sar` need the count register
to be rcx, and the allocator pins the named `count` temp accordingly
without the user having to remember it.
Convert another batch of straightforward handlers to named DSL
temporaries: Increment, Decrement, Not, GetLexicalEnvironment, Return,
and End. The migration uncovers a pleasing property of the new
pipeline: when a migrated handler invokes an un-migrated macro
(pop_inline_frame_and_resume here, which still uses positional t2/t3
/t4 internally), the macro body's positional uses appear as physical-
register defs in the flat instruction list and the allocator's
liveness analysis avoids placing live named temps in those registers.
Convert the three boolean-conditional jump handlers to named temps,
exercising the 3-operand call_helper form so the input value and
truthy result no longer have to be pinned to t1/t0 by hand. The
allocator places `condition` in rcx (matching call_helper's input
convention) and `truthy` in rax (matching its output) automatically.
Convert Jump, JumpNullish, and JumpUndefined to declare named
temporaries instead of pinning values to t0..t2 directly. These three
handlers are macro-free and call-free, so the migration is mechanical
and the allocator picks register assignments equivalent in cost to
the hand-written positional code.
The other Jump* handlers (JumpIf/JumpTrue/JumpFalse) drive
call_helper, which still needs the value to be in t1 (rcx) on x86 by
the existing register convention. Migrating those needs DSL syntax to
attach call_helper's input/output to named temps; until that lands
they stay on the positional form.
These three trivial data-movement handlers are a useful first
migration target -- they exercise the new `temp` declaration syntax
and the register allocator end-to-end without interacting with macros
or fixed-operand constraints. The allocator picks t1-equivalent
registers for the first temp (which is the smallest live range that
must survive load_operand's rax/x9 scratch use) and reuses the slot
across non-overlapping live ranges where possible.
Introduce an allocator that runs whenever a handler -- or any macro it
transitively invokes -- declares named temporaries with `temp` /
`ftemp`. The allocator:
* Pre-expands macros into a flat instruction list and uniquifies
self-contained labels and `temp` declarations per macro
expansion, so two invocations of the same macro never share names.
* Computes per-instruction use/def/kill sets from InstructionInfo,
accounting for hidden clobbers, implicit register inputs/outputs,
and the all-caller-saved kill at non-terminal C++ calls.
* Runs iterative backward-dataflow liveness so branches and
macro-introduced loops are handled correctly.
* Greedily picks a physical register for each named temp from the
public DSL pool, honoring fixed-operand constraints (e.g. x86
shifts demand the count in rcx, divmod demands rax/rdx).
* Hard-errors when a temp cannot be placed instead of spilling.
* Rewrites operands so the existing codegen sees only physical
register names.
Handlers that don't use named temps continue to flow through the
existing recursive macro-expansion path, so generated assembly is
unchanged for the unmigrated asmint.asm. New unit tests cover
simple allocation, interference, fixed-operand pinning, double
declaration, positional-alias shadowing, calls killing live temps,
GPR/FPR pool separation, and macro-local uniquification.
Add the syntax that user-facing handler bodies will eventually use to
introduce named GPR and FPR temporaries:
temp foo, bar
ftemp baz
The parser already produces the right IR for these (an instruction
with the literal mnemonic `temp` / `ftemp` and a list of identifier
operands); both codegens now treat the mnemonics as no-ops so the
catch-all panic does not fire when handlers contain declarations.
The register allocator that consumes them is not yet wired up, so
existing positional usage of t0..t8 / ft0..ft3 continues to work.
Document the new syntax in the DSL reference and add parser tests
covering the single- and multi-name forms.
Drop t9..t17 from the aarch64 temporaries list. Those names mapped to
x9..x17, but the codegen unconditionally uses x9 as a scratch register
for materializing large immediates, dispatch tails, and pair-memory
base computations, and uses x10 as a secondary scratch in dispatch and
pair-memory paths. Exposing them as DSL names was a footgun: any user
that wrote `t9` would have its value silently overwritten the next time
the codegen needed scratch.
Document the reserved-scratch convention on the RegisterMapping doc
comment, and update the DSL reference in main.rs to the accurate
t0..t8 range.
Introduce a canonical table that describes every DSL instruction's
operand kinds, control-flow behavior, and per-architecture register
footprint: hidden scratch clobbers, implicit register inputs/outputs,
and hard-fixed operand register requirements (e.g. x86 shifts demand
the count in rcx, divmod writes rax/rdx).
The table will back a register allocator for named DSL temporaries.
Today nothing consumes it; this commit just lands the data and a few
unit tests that hold the table in sync with the codegen mnemonic set.
Move owned ArrayBuffer storage directly when transferring stream
buffers instead of copying the bytes before detaching the source.
WebAssembly memory continues to copy because its ArrayBuffer wraps
externally-owned storage.
Preserve the abrupt completion from DetachArrayBuffer before moving
storage so non-transferable buffers, such as WebAssembly.Memory-backed
views, still surface TypeError through stream operations instead of
aborting.
This saves ~130ms of main thread time when loading a YouTube video
on my Linux computer. :^)
WebAssembly.Memory-backed ArrayBuffers wrap external
ByteBuffer storage. When that memory grows,
ByteBuffer::try_resize() may realloc the backing storage while
old fixed-length buffer objects remain reachable from JS.
TypedArrayBase cached m_data for all fixed-length buffers, and
the asm interpreter fast path dereferenced that cached pointer
directly. For wasm memory views this could leave a stale
pointer behind across grow().
Restrict cached typed-array data pointers to fixed-length
ArrayBuffers that own stable ByteBuffer storage.
External/unowned buffers, including WebAssembly.Memory
buffers, now keep m_data == nullptr and fall back to code that
re-derives buffer().data() on each access.
Add regressions for both the original shared-memory grow case
and the second-grow stale-view case.
invalidate_all_prototype_chains_leading_to_this used to scan every
prototype shape in the realm and walk each one's chain looking for
the mutated shape. That was O(N_prototype_shapes x chain_depth) per
mutation and showed up hot in real profiles when a page churned a
lot of prototype state during startup.
Each prototype shape now keeps a weak list of the prototype shapes
whose immediate [[Prototype]] points at the object that owns this
shape. The list is registered on prototype-shape creation
(clone_for_prototype, set_prototype_shape) and migrated to the new
prototype shape when the owning prototype object transitions to a
new shape. Invalidation is then a recursive walk over this direct-
child registry, costing O(transitive descendants).
Saves ~300 ms of main thread time when loading https://youtube.com/
on my Linux machine. :^)
No other engine defines this function, so it is an observable difference
of our engine. This traces back to the earliest days of LibJS.
We now define `gc` in just the test-js and test262 runners.
indexed_take_first() already memmoves elements down for both Packed and
Holey storage, but the caller at ArrayPrototype::shift() only entered
the fast path for Packed arrays. Holey arrays fell through to the
spec-literal per-element loop (has_property / get / set /
delete_property_or_throw), which is substantially slower.
Add a separate Holey predicate with the additional safety checks the
spec semantics require: default_prototype_chain_intact() (so
HasProperty on a hole doesn't escape to a poisoned prototype) and
extensible() (so set() on a hole slot doesn't create a new own
property on a non-extensible object). The existing Packed predicate
is left unchanged -- packed arrays don't need these checks because
every index in [0, size) is already an own data property.
Allows us to fail at Cloudflare Turnstile way much faster!
The element-by-element loop compiled to scalar 8-byte moves that the
compiler could not vectorize: source and destination alias, and strict
aliasing prevented hoisting the m_indexed_elements pointer load out of
the loop body. memmove collapses the shift into a single vectorized
copy.
Previously it used `realm.[[GlobalObject]]` instead of
`realm.[[GlobalEnv]].[[GlobalThisValue]]`.
In LibWeb, that corresponds to Window and WindowProxy respectively.
Shape::visit_edges used to walk every entry of m_property_table and
call PropertyKey::visit_edges on each key. For non-dictionary shapes
that work is redundant: m_property_table is a lazily built cache of
the transition chain, and every key it contains was originally
inserted as some ancestor shape's m_property_key, which is already
kept alive via m_previous.
Intrinsic shapes populated through add_property_without_transition()
in Intrinsics.cpp are not dictionaries and have no m_previous to
reach their keys through, but each of those keys is either a
vm.names.* string or a well-known symbol and is strongly rooted by
the VM for its whole lifetime, so skipping them here is safe too.
Measured on the main WebWorker used by https://www.maptiler.com/maps/
this cuts out ~98% of the PropertyKey::visit_edges calls made by
Shape::visit_edges each GC, reducing time spent in GC by ~1.3 seconds
on my Linux PC while initially loading the map.
Carry full source positions through the Rust bytecode source map so
stack traces and other bytecode-backed source lookups can use them
directly.
This keeps exception-heavy paths from reconstructing line and column
information through SourceCode::range_from_offsets(), which can spend a
lot of time building SourceCode's position cache on first use.
We're trading some space for time here, but I believe it's worth it at
this tag, as this saves ~250ms of main thread time while loading
https://x.com/ on my Linux machine. :^)
Reading the stored Position out of the source map directly also exposed
two things masked by the old range_from_offsets() path: a latent
off-by-one in Lexer::new_at_offset() (its consume() bumped line_column
past the character at offset; only synthesize_binding_pattern() hit it),
and a (1,1) fallback in range_from_offsets() that fired whenever the
queried range reached EOF. Fix the lexer, then rebaseline both the
bytecode dump tests (no more spurious "1:1") and the destructuring AST
tests (binding-pattern identifiers now report their real columns).
CompareArrayElements was calling ToString(x) +
PrimitiveString::create(vm, ...) on every comparison, producing a
fresh PrimitiveString that wrapped the original's AK::String but
carried no cached UTF-16. The subsequent IsLessThan then hit
PrimitiveString::utf16_string_view() on that fresh object, which
re-ran simdutf UTF-8 validation + UTF-8 -> UTF-16 conversion for
both sides on every one of the N log N comparisons.
When x and y are already String Values, ToString(x) and
ToPrimitive(x, Number) are the identity per spec, so we can drop
the IsLessThan detour entirely and compare their Utf16Views
directly. The original PrimitiveString caches its UTF-16 on first
access, so subsequent comparisons against the same element hit
the cache; Utf16View::operator<=> additionally gives us a memcmp
fast path when both sides ended up with short-ASCII UTF-16 storage.
Microbenchmark:
```js
function makeStrings(n) {
let seed = 1234567;
const rand = () => {
seed = (seed * 1103515245 + 12345) & 0x7fffffff;
return seed;
};
const out = new Array(n);
for (let i = 0; i < n; i++)
out[i] = "item_" + rand().toString(36)
+ "_" + rand().toString(36);
return out;
}
const base = makeStrings(100000);
const arr = base.slice();
arr.sort();
```
```
n before after speedup
1k 0.70ms 0.30ms 2.3x
10k 8.33ms 3.33ms 2.5x
50k 49.33ms 17.33ms 2.8x
100k 118.00ms 45.00ms 2.6x
```
Generator::allocate_register used to scan the free pool to find the
lowest-numbered register and then Vec::remove it, making every
allocation O(n) in the size of the pool. When loading https://x.com/
on my Linux machine, we spent ~800ms in this function alone!
This logic only existed to match the C++ register allocation ordering
while transitioning from C++ to Rust in the LibJS compiler, so now
we can simply get rid of it and make it instant. :^)
So drop the "always hand out the lowest-numbered free register" policy
and use the pool as a plain LIFO stack. Pushing and popping the back
of the Vec are both O(1), and peak register usage is unchanged since
the policy only affects which specific register gets reused, not how
aggressively.