Commit Graph

39 Commits

Author SHA1 Message Date
Andreas Kling
4780d25df5 AsmIntGen: Avoid aarch64 helper-call bridge moves
The named 3-operand call_helper form lets the allocator see the
helper input and output as distinct temporaries. On aarch64 those
values do not overlap: the input dies at the call boundary, and the
output is born from the return value.

Pin both temporaries to x0, which is both the first AAPCS64 argument
register and the return register. This lets the aarch64 codegen omit
the old mov x0, x1 bridge before named call_helper uses, while
leaving the legacy 1-operand convention alone.

Add an allocator test for the aarch64 pinning so the calling-convention
intent stays explicit.
2026-04-26 13:29:56 +02:00
Andreas Kling
f90710e571 AsmIntGen: Bias the allocator toward cheap x86_64 encodings
The greedy register allocator already preferred low-numbered GPRs by
virtue of the pool being listed in cost order, but the policy was
implicit and the temp-processing order was driven purely by live-range
length. That meant a long-lived but rarely-referenced temp could grab
`rax` and push a hot temp into a more expensive register.

Make the encoding-cost policy explicit and route hot temps to cheap
registers:

  * `register_cost` in registers.rs scores each physical register by
    encoding cost. `rax`/`eax` is cheapest because of the accumulator-
    form short encodings (`add eax, imm32` is one byte shorter than
    `add r/m32, imm32`). The other classic GPRs avoid the REX prefix
    in 32-bit forms. `r8`..`r15` always need a REX extension byte.
    aarch64 is cost-uniform.

  * The allocator now sorts named temps by use count first (so hot
    temps grab cheap registers), with live-range length as a
    tiebreaker. Greedy graph coloring is order-sensitive, so the
    `Call` handler -- which packs 51 temps into 9 registers -- needs
    the live-range-first order to color at all. We try cost-first and
    fall back to fit-first only when the first attempt cannot color.

  * Register selection now picks the cheapest available register
    explicitly via `min_by_key(register_cost)` instead of relying on
    pool position. Pool order breaks ties for determinism.
2026-04-26 13:29:56 +02:00
Andreas Kling
928a9dfbf7 LibJS+AsmIntGen: Retire the positional t0..t8 / ft0..ft3 DSL aliases
asmint.asm no longer references any positional temp register name --
every handler and macro declares its temporaries by name with `temp` /
`ftemp` and lets the register allocator place them. Migrate the last
two macros holding out:

  * dispatch_current uses a macro-local `opcode` temp for the load8 +
    indirect jmp.
  * pop_inline_frame_and_resume names its return-pc, dst-index, value-
    address, vm-pointer, and executable temps explicitly.

With nothing left referring to the positional aliases, drop the
tN / ftN -> physical-register fallback from registers::resolve_register
and update the DSL reference comments at the top of asmint.asm and in
main.rs to describe the named-temp model. The two pre-existing codegen
tests that probed the old positional behavior get rewritten to use the
post-allocation physical-register names directly, since that is now
the actual contract of resolve_op.
2026-04-26 13:29:56 +02:00
Andreas Kling
9e6a205575 LibJS: Migrate Div and Mod handlers to named DSL temporaries
Convert the Div and Mod handlers to named DSL temporaries. Mod is the
first migrated handler that uses divmod, with its quot/rem pinned to
rax/rdx by fixed_operands and its dividend/divisor kept off both by
the operand-vs-implicit-output interference rule.
2026-04-26 13:29:56 +02:00
Andreas Kling
197ed3de24 AsmIntGen: Add register allocator for named DSL temporaries
Introduce an allocator that runs whenever a handler -- or any macro it
transitively invokes -- declares named temporaries with `temp` /
`ftemp`. The allocator:

  * Pre-expands macros into a flat instruction list and uniquifies
    self-contained labels and `temp` declarations per macro
    expansion, so two invocations of the same macro never share names.
  * Computes per-instruction use/def/kill sets from InstructionInfo,
    accounting for hidden clobbers, implicit register inputs/outputs,
    and the all-caller-saved kill at non-terminal C++ calls.
  * Runs iterative backward-dataflow liveness so branches and
    macro-introduced loops are handled correctly.
  * Greedily picks a physical register for each named temp from the
    public DSL pool, honoring fixed-operand constraints (e.g. x86
    shifts demand the count in rcx, divmod demands rax/rdx).
  * Hard-errors when a temp cannot be placed instead of spilling.
  * Rewrites operands so the existing codegen sees only physical
    register names.

Handlers that don't use named temps continue to flow through the
existing recursive macro-expansion path, so generated assembly is
unchanged for the unmigrated asmint.asm. New unit tests cover
simple allocation, interference, fixed-operand pinning, double
declaration, positional-alias shadowing, calls killing live temps,
GPR/FPR pool separation, and macro-local uniquification.
2026-04-26 13:29:56 +02:00
Andreas Kling
63b356e38a AsmIntGen: Accept temp and ftemp declarations in the DSL
Add the syntax that user-facing handler bodies will eventually use to
introduce named GPR and FPR temporaries:

    temp foo, bar
    ftemp baz

The parser already produces the right IR for these (an instruction
with the literal mnemonic `temp` / `ftemp` and a list of identifier
operands); both codegens now treat the mnemonics as no-ops so the
catch-all panic does not fire when handlers contain declarations.
The register allocator that consumes them is not yet wired up, so
existing positional usage of t0..t8 / ft0..ft3 continues to work.

Document the new syntax in the DSL reference and add parser tests
covering the single- and multi-name forms.
2026-04-26 13:29:56 +02:00
Andreas Kling
90bc456305 AsmIntGen: Reserve aarch64 codegen scratch outside the DSL temp pool
Drop t9..t17 from the aarch64 temporaries list. Those names mapped to
x9..x17, but the codegen unconditionally uses x9 as a scratch register
for materializing large immediates, dispatch tails, and pair-memory
base computations, and uses x10 as a secondary scratch in dispatch and
pair-memory paths. Exposing them as DSL names was a footgun: any user
that wrote `t9` would have its value silently overwritten the next time
the codegen needed scratch.

Document the reserved-scratch convention on the RegisterMapping doc
comment, and update the DSL reference in main.rs to the accurate
t0..t8 range.
2026-04-26 13:29:56 +02:00
Andreas Kling
f1afb01345 AsmIntGen: Add InstructionInfo metadata table
Introduce a canonical table that describes every DSL instruction's
operand kinds, control-flow behavior, and per-architecture register
footprint: hidden scratch clobbers, implicit register inputs/outputs,
and hard-fixed operand register requirements (e.g. x86 shifts demand
the count in rcx, divmod writes rax/rdx).

The table will back a register allocator for named DSL temporaries.
Today nothing consumes it; this commit just lands the data and a few
unit tests that hold the table in sync with the codegen mnemonic set.
2026-04-26 13:29:56 +02:00
Andreas Kling
583fa475fb LibJS: Call RawNativeFunction directly from asm Call
The asm interpreter already inlines ECMAScript calls, but builtin calls
still went through the generic C++ Call slow path even when the callee
was a plain native function pointer. That added an avoidable boundary
around hot builtin calls and kept asm from taking full advantage of the
new RawNativeFunction representation.

Teach the asm Call handler to recognize RawNativeFunction, allocate the
callee frame on the interpreter stack, copy the call-site arguments,
and jump straight to the stored C++ entry point.
NativeJavaScriptBackedFunction and other non-raw callees keep falling
through to the existing C++ slow path unchanged.
2026-04-15 15:57:48 +02:00
Andreas Kling
517812647a LibJS: Pack asm Call shared-data metadata
Pack the asm Call fast path metadata next to the executable pointer
so the interpreter can fetch both values with one paired load. This
removes several dependent shared-data loads from the hot path.

Keep the executable pointer and packed metadata in separate registers
through this binding so the fast path can still use the paired-load
layout after any non-strict this adjustment.

Lower the packed metadata flag checks correctly on x86_64 as well.
Those bits now live above bit 31, so the generator uses bt for single-
bit high masks and covers that path with a unit test.

Add a runtime test that exercises both object and global this binding
through the asm Call fast path.
2026-04-14 12:37:12 +02:00
Andreas Kling
fa931612e1 LibJS: Pair-store the asm Call frame setup
Teach the asm Call fast path to use paired stores for the fixed
ExecutionContext header writes and for the caller linkage fields.
This also initializes the five reserved Value slots directly instead
of looping over them as part of the general register clear path.

That keeps the hot frame setup work closer to the actual data layout:
reserved registers are seeded with a couple of fixed stores, while the
remaining register and local slots are cleared in wider chunks.

On x86_64, keep the new explicit-offset formatting on store_pair*
and load_pair* without changing ordinary [base, index, scale]
operands into base-plus-index-plus-offset addresses. Add unit
tests covering both the paired zero-offset form and the preserved
scaled-index lowering.
2026-04-14 12:37:12 +02:00
Andreas Kling
fcbbc6a4b8 LibJS: Add paired stores to the AsmInt DSL
Teach AsmIntGen about store_pair32 and store_pair64 so hot handlers
can describe adjacent writes just as explicitly as adjacent reads.
The DSL now requires naming both memory operands and rejects
non-adjacent or reordered pairs at code generation time.

On aarch64 the new instructions lower to stp when the address is
encodable, while x86_64 keeps the same semantics with two scalar
stores. The shared validation keeps the paired access rules consistent
across both load and store primitives.
2026-04-14 12:37:12 +02:00
Andreas Kling
ce753047b0 LibJS: Add verifiable paired loads to the AsmInt DSL
Add load_pair32 and load_pair64 to the AsmInt DSL and make the
generator verify that both named memory operands are truly adjacent.
That keeps paired loads self-documenting in the DSL instead of
hiding the second field behind an implicit adjacency assumption.

AArch64 now lowers valid pairs to ldp when the address form allows
it, while x86_64 keeps the same behavior with two obvious scalar loads.
Add unit tests for the shared validator so reversed or non-adjacent
field pairs are rejected during code generation.
2026-04-14 12:37:12 +02:00
Andreas Kling
4405c52042 LibJS: Zero-extend 32-bit AArch64 asm immediates
Teach the AArch64 AsmInt generator to materialize immediates through
w-register writes when the upper 32 bits are known zero.

That keeps the same x-register value while letting common constants
use shorter instruction sequences.
2026-04-14 08:14:43 +02:00
Andreas Kling
960a36db53 LibJS: Lower zero store immediates to zero registers on AArch64
Teach the AArch64 AsmInt generator to lower zero-immediate stores
through xzr or wzr instead of materializing a temporary register.

This covers store64 as well as the narrow store8, store16, and
store32 forms, keeping the generated code shorter on the zero
store fast path.
2026-04-14 08:14:43 +02:00
Andreas Kling
87797e9161 LibJS: Use tbz and tbnz for single-bit asm branches
AsmIntGen already lowers branch_zero and branch_nonzero to the compact
AArch64 branch-on-bit forms when possible, but branch_bits_set and
branch_bits_clear still expanded single-bit immediates into tst plus a
separate conditional branch.

Teach the AArch64 backend to recognize power-of-two masks and emit
tbnz or tbz directly. This shortens several hot interpreter paths.
2026-04-14 08:14:43 +02:00
Andreas Kling
b1dab18e42 LibJS: Teach AsmIntGen helper primitives
Add load_vm, memory-operand macro substitution, and a generic
inc32_mem instruction to the AsmInt DSL.

Also drop redundant mov reg, reg copies in the backends so handlers
that use the new helpers expand to cleaner assembly.
2026-04-14 08:14:43 +02:00
Andreas Kling
2ca7dfa649 LibJS: Move bytecode interpreter state to VM
The bytecode interpreter only needed the running execution context,
but still threaded a separate Interpreter object through both the C++
and asm entry points. Move that state and the bytecode execution
helpers onto VM instead, and teach the asm generator and slow paths to
use VM directly.
2026-04-13 18:29:43 +02:00
Andreas Kling
114eeddea1 AsmIntGen: Avoid clobbering r11 in store_operand
The x86_64 asm interpreter mapped t8 to r11, but store_operand
also used r11 as its scratch register for operand loads. When a
handler stored a JS value from t8, the scratch load overwrote the
value first and wrote raw operand bits into the register file.
2026-04-10 15:12:53 +02:00
Andreas Kling
1ff61754a7 LibJS: Re-box double arithmetic results as Int32 when possible
When the asmint computes a double result for Add, Sub, Mul,
Math.floor, Math.ceil, or Math.sqrt, try to store it as Int32
if the value is a whole number in [INT32_MIN, INT32_MAX] and
not -0.0. This mirrors the JS::Value(double) constructor and
allows downstream int32 fast paths to fire.

Also add label uniquification to the DSL macro expander so the
same macro can be used multiple times in one handler without
label collisions.
2026-03-19 09:42:04 +01:00
Andreas Kling
5e403af5be LibJS: Tighten asmint ToInt32 boxing
Teach js_to_int32 to leave a clean low 32-bit result on success, then
use box_int32_clean in the ToInt32 fast path and adjacent boolean
coercions. This removes one instruction from the AArch64 fjcvtzs path
and trims the boolean boxing path without changing behavior.
2026-03-19 09:42:04 +01:00
Andreas Kling
645f481825 LibJS: Fast-path Float32Array indexed access
Add the small AsmIntGen float32 load, store, and conversion operations
needed to handle Float32Array directly in the AsmInt typed-array
GetByValue and PutByValue paths.

This covers direct indexed reads plus both int32 and double stores,
and adds regression coverage for Math.fround rounding, negative zero,
and NaN.
2026-03-19 09:42:04 +01:00
Luke Wilde
75725e283d LibJS/AsmIntGen: Emit unwind info for AArch64
This restores backtrace functionality on AArch64 when crossing a
bytecode call frame, where it would previously stop.
2026-03-16 19:30:40 -05:00
Luke Wilde
1378d37e92 LibJS/AsmIntGen: Emit unwind info for x86_64
This restores backtrace functionality on x86_64 when crossing a
bytecode call frame, where it would previously stop.
2026-03-16 19:30:40 -05:00
Andreas Kling
c3deaa4746 AsmIntGen: Panic on missing constants
Replace all unwrap_or(0) and parse().unwrap_or(0) calls in the
asmint code generator with expect()/panic! so that missing
constants or unparseable literals cause a build-time failure
instead of silently generating wrong code.
2026-03-08 23:04:55 +01:00
Andreas Kling
30d7b7db20 AsmIntGen: Fix aarch64 mul32_overflow leaving garbage in upper 32 bits
The smull instruction writes a 64-bit result to the destination
register. For negative results like 1 * -1 = -1, this means the
upper 32 bits are all 1s (sign extension of the 64-bit value).

The subsequent box_int32_clean assumed the upper 32 bits were
already zero, so it just set the NaN-boxing tag with movk. This
produced a corrupted Value where strict equality (===) would fail
even though the numeric value was correct.

Fix this by adding a mov wN, wN after the overflow check to
zero-extend the 32-bit result, matching what add32_overflow and
sub32_overflow already do by writing to W registers.
2026-03-08 23:04:55 +01:00
Andreas Kling
d5eed2632f AsmInt: Add branch_zero32/branch_nonzero32 to the asmint DSL
These test only the low 32 bits of a register, replacing the previous
pattern of `and reg, 0xFFFFFFFF` followed by `branch_zero` or
`branch_nonzero`.

On aarch64 the old pattern emitted `mov w1, w1; cbnz x1` (2 insns),
now it's just `cbnz w1` (1 insn). Used in JumpIf, JumpTrue, JumpFalse,
and Not for the int32 truthiness fast path.
2026-03-08 23:04:55 +01:00
Andreas Kling
2db4d30e56 AsmIntGen: Stop maintaining w25 (pc) on every asmint dispatch on aarch64
x21 (instruction pointer = pb + pc) is already the primary dispatch
register. Maintaining w25 (the 32-bit pc offset) in parallel on every
dispatch_next, goto_handler, and dispatch_variable was redundant.

Compute the 32-bit pc on demand via `sub w1, w21, w26` only when
calling into C++ (slow paths), which is the cold path. This removes
one instruction from every hot dispatch sequence and every jump target.

The generated output shrinks from 4692 to 4345 lines (~347 instructions
removed), with every handler benefiting from shorter dispatch tails.
2026-03-08 23:04:55 +01:00
Andreas Kling
949454feb9 AsmIntGen: Pin {INT32,BOOLEAN,NAN_BASE}_TAG in callee-saved registers
These three tag constants (0x7FFA, 0x7FF9, 0x7FF8) exceed the 12-bit
cmp immediate range on aarch64, so every comparison required a mov+cmp
pair. Pin them in x22, x23, x24 (callee-saved, previously unused) to
turn ~160 two-instruction sequences into single cmp instructions.
2026-03-08 23:04:55 +01:00
Andreas Kling
368efef620 AsmIntGen: Support [pb, pc, field] three-operand memory access
Teach the DSL and both arch backends to handle memory operands of
the form [pb, pc, field_ref], meaning base + index + field_offset.

On aarch64, since x21 already caches pb + pc (the instruction
pointer), this emits a single `ldr dst, [x21, #offset]` instead of
the previous `mov t0, x21` + `ldr dst, [t0, #offset]` two-instruction
sequence.

On x86_64, this emits `[r14 + r13 + offset]` which is natively
supported by x86 addressing modes.

Convert all `lea t0, [pb, pc]` + `loadNN tX, [t0, field]` pairs in
the DSL to the new single-instruction form, saving one instruction
per IC access and other field loads in GetById, PutById, GetLength,
GetGlobal, SetGlobal, and CallBuiltin handlers.
2026-03-08 10:27:13 +01:00
Andreas Kling
8936cda523 AsmIntGen: Cache pb+pc in callee-saved x21 on aarch64
Pin x21 = pb + pc (the instruction pointer) as a callee-saved register
that survives C++ calls. x21 is set during dispatch and remains valid
throughout the entire handler.

This eliminates redundant `add x9, x26, x25` instructions from every
load_operand, store_operand, load_label, and dispatch_next sequence.
Also optimizes `lea dst, [pb, pc]` to `mov dst, x21`.

For dispatch_next, the next opcode is loaded via `ldrb w9, [x21, #size]`
and x21 is updated incrementally (`add x21, x21, #size`), which also
improves the dependency chain vs recomputing from x26 + x25.

dispatch_current is promoted from a DSL macro to a codegen instruction
so it can set x21 for the next handler.
2026-03-07 22:18:22 +01:00
Andreas Kling
8afa1df951 AsmIntGen: Pin canonical NaN in callee-saved d8 on aarch64
Load CANON_NAN_BITS into d8 (a callee-saved FP register) at
interpreter entry. This avoids materializing the 64-bit constant
in every canonicalize_nan cold fixup block.

Before: cold block was `movz x9, ... / movk x9, ... / b ret`
After:  cold block is just `fmov xD, d8 / b ret`

The hot path (fmov + fcmp + b.vs) is unchanged. The constant is
only needed when the result is actually NaN, which is rare, but
this still shrinks code size and avoids the multi-instruction
immediate materialization at 11 call sites.
2026-03-07 22:18:22 +01:00
Andreas Kling
e486ad2c0c AsmIntGen: Use platform-optimal codegen for NaN-boxing operations
Convert extract_tag, unbox_int32, unbox_object, box_int32, and
box_int32_clean from DSL macros into codegen instructions, allowing
each backend to emit optimal platform-specific code.

On aarch64, this produces significant improvements:

- extract_tag: single `lsr xD, xS, #48` instead of `mov` + `lsr`
  (3-operand shifts are free on ARM). Saves 1 instruction at 57
  call sites.

- unbox_object: single `and xD, xS, #0xffffffffffff` instead of
  `mov` + `shl` + `shr`. The 48-bit mask is a valid ARM64 logical
  immediate. Saves 2 instructions at 6 call sites.

- box_int32: `mov wD, wS` + `movk xD, #tag, lsl #48` instead of
  `mov` + `and 0xFFFFFFFF` + `movabs tag` + `or`. The w-register
  mov zero-extends, and movk overwrites just the top 16 bits.
  Saves 2 instructions and no longer clobbers t0 (rax).

- box_int32_clean: `movk xD, #tag, lsl #48` (1 instruction) instead
  of `mov` + `movabs tag` + `or` (saves 2 instructions, no t0
  clobber).

On x86_64, the generated code is equivalent to the old macros.
2026-03-07 22:18:22 +01:00
Andreas Kling
3e7fa8b09a AsmInt: Use 32-bit NOT in BitwiseNot handler
Add a not32 DSL instruction that operates on the 32-bit sub-register,
zeroing the upper 32 bits (x86_64: not r32, aarch64: mvn w_reg).

Use it in BitwiseNot to avoid the sign-extension (unbox_int32), 64-bit
NOT, and explicit AND 0xFFFFFFFF. The 32-bit NOT produces a clean
upper half, so we can use box_int32_clean directly.

Before: movsxd + not r64 + and 0xFFFFFFFF + and 0xFFFFFFFF + or tag
After:  mov + not r32 + or tag
2026-03-07 22:18:22 +01:00
Andreas Kling
6492c88ad8 AsmIntGen: Elide redundant FP comparisons in consecutive branch_fp_*
When consecutive branch_fp_* instructions use the same operands (e.g.
branch_fp_unordered followed by branch_fp_equal), the 2nd ucomisd/fcmp
is redundant since flags are still valid from the first comparison.

Track the last FP comparison operands in HandlerState and skip the
comparison instruction when it would be identical. This is common in
the double_equality_compare macro which checks for unordered (NaN)
before testing equality.
2026-03-07 22:18:22 +01:00
Andreas Kling
472edb3448 AsmIntGen: Use mov r32 for unsigned 32-bit immediates on x86_64
Values in the range 0x80000000..0xFFFFFFFF were incorrectly emitted
as plain `mov r64, imm` which GAS encodes as a 10-byte movabs. Use
`mov r32, imm32` instead (5 bytes, implicitly zero-extends to 64
bits). This affects constants like ENVIRONMENT_COORDINATE_INVALID
(0xFFFFFFFE) which appeared 5 times in the generated assembly.
2026-03-07 22:18:22 +01:00
Andreas Kling
c6fd52e317 AsmIntGen: Move NaN canonicalization to cold fixup blocks
canonicalize_nan previously emitted its full NaN fixup inline:
on x86_64, a 10-byte movabs + cmovp; on aarch64, a multi-instruction
mov sequence + fcsel. These were always on the hot path even though
NaN results from arithmetic are extremely rare.

Move the NaN fixup to a cold block emitted after the handler body.
The hot path is now just: movq/fmov + ucomisd/fcmp + jp/b.vs (a
forward branch predicted not-taken). This removes 14 bytes of
instructions from the hot path of every handler that produces
double results (Add, Sub, Mul, Div, and several builtins).

Both backends gain a HandlerState struct (shared between them) that
accumulates cold fixup blocks during code generation, emitted after
the main body.
2026-03-07 22:18:22 +01:00
Andreas Kling
5b8114a96b AsmInt: Use hardware overflow flag for int32 arithmetic
Replace the pattern of 64-bit arithmetic + sign-extend + compare
with dedicated 32-bit overflow instructions that use the hardware
overflow flag directly.

Before: add t3, t4 / unbox_int32 t5, t3 / branch_ne t3, t5, .overflow
After:  add32_overflow t3, t4, .overflow

On x86_64 this compiles to `add r32, r32; jo label` (the 32-bit
register write implicitly zeros the upper 32 bits). On aarch64,
`adds w, w, w; b.vs label` for add/sub, `smull + sxtw + cmp + b.ne`
for multiply, and `negs + b.vs` for negate.

Nine call sites updated: Add, Sub, Mul, Increment, Decrement,
PostfixIncrement, PostfixDecrement, UnaryMinus, and CallBuiltin(abs).
2026-03-07 22:18:22 +01:00
Andreas Kling
9ae5445493 LibJS: Add AsmIntGen assembly interpreter code generator
AsmIntGen is a Rust tool that compiles a custom assembly DSL into
native x86_64 or aarch64 assembly (.S files). It reads struct field
offsets from a generated constants file and instruction layouts from
Bytecode.def (via the BytecodeDef crate) to emit platform-specific
code for each bytecode handler.

The DSL provides a portable instruction set with register aliases,
field access syntax, labels, conditionals, and calls. Each backend
(codegen_x86_64.rs, codegen_aarch64.rs) translates this into the
appropriate platform assembly with correct calling conventions
(SysV AMD64, AAPCS64).
2026-03-07 13:09:59 +01:00