Replace all unwrap_or(0) and parse().unwrap_or(0) calls in the
asmint code generator with expect()/panic! so that missing
constants or unparseable literals cause a build-time failure
instead of silently generating wrong code.
The smull instruction writes a 64-bit result to the destination
register. For negative results like 1 * -1 = -1, this means the
upper 32 bits are all 1s (sign extension of the 64-bit value).
The subsequent box_int32_clean assumed the upper 32 bits were
already zero, so it just set the NaN-boxing tag with movk. This
produced a corrupted Value where strict equality (===) would fail
even though the numeric value was correct.
Fix this by adding a mov wN, wN after the overflow check to
zero-extend the 32-bit result, matching what add32_overflow and
sub32_overflow already do by writing to W registers.
These test only the low 32 bits of a register, replacing the previous
pattern of `and reg, 0xFFFFFFFF` followed by `branch_zero` or
`branch_nonzero`.
On aarch64 the old pattern emitted `mov w1, w1; cbnz x1` (2 insns),
now it's just `cbnz w1` (1 insn). Used in JumpIf, JumpTrue, JumpFalse,
and Not for the int32 truthiness fast path.
x21 (instruction pointer = pb + pc) is already the primary dispatch
register. Maintaining w25 (the 32-bit pc offset) in parallel on every
dispatch_next, goto_handler, and dispatch_variable was redundant.
Compute the 32-bit pc on demand via `sub w1, w21, w26` only when
calling into C++ (slow paths), which is the cold path. This removes
one instruction from every hot dispatch sequence and every jump target.
The generated output shrinks from 4692 to 4345 lines (~347 instructions
removed), with every handler benefiting from shorter dispatch tails.
These three tag constants (0x7FFA, 0x7FF9, 0x7FF8) exceed the 12-bit
cmp immediate range on aarch64, so every comparison required a mov+cmp
pair. Pin them in x22, x23, x24 (callee-saved, previously unused) to
turn ~160 two-instruction sequences into single cmp instructions.
Teach the DSL and both arch backends to handle memory operands of
the form [pb, pc, field_ref], meaning base + index + field_offset.
On aarch64, since x21 already caches pb + pc (the instruction
pointer), this emits a single `ldr dst, [x21, #offset]` instead of
the previous `mov t0, x21` + `ldr dst, [t0, #offset]` two-instruction
sequence.
On x86_64, this emits `[r14 + r13 + offset]` which is natively
supported by x86 addressing modes.
Convert all `lea t0, [pb, pc]` + `loadNN tX, [t0, field]` pairs in
the DSL to the new single-instruction form, saving one instruction
per IC access and other field loads in GetById, PutById, GetLength,
GetGlobal, SetGlobal, and CallBuiltin handlers.
Pin x21 = pb + pc (the instruction pointer) as a callee-saved register
that survives C++ calls. x21 is set during dispatch and remains valid
throughout the entire handler.
This eliminates redundant `add x9, x26, x25` instructions from every
load_operand, store_operand, load_label, and dispatch_next sequence.
Also optimizes `lea dst, [pb, pc]` to `mov dst, x21`.
For dispatch_next, the next opcode is loaded via `ldrb w9, [x21, #size]`
and x21 is updated incrementally (`add x21, x21, #size`), which also
improves the dependency chain vs recomputing from x26 + x25.
dispatch_current is promoted from a DSL macro to a codegen instruction
so it can set x21 for the next handler.
Load CANON_NAN_BITS into d8 (a callee-saved FP register) at
interpreter entry. This avoids materializing the 64-bit constant
in every canonicalize_nan cold fixup block.
Before: cold block was `movz x9, ... / movk x9, ... / b ret`
After: cold block is just `fmov xD, d8 / b ret`
The hot path (fmov + fcmp + b.vs) is unchanged. The constant is
only needed when the result is actually NaN, which is rare, but
this still shrinks code size and avoids the multi-instruction
immediate materialization at 11 call sites.
Convert extract_tag, unbox_int32, unbox_object, box_int32, and
box_int32_clean from DSL macros into codegen instructions, allowing
each backend to emit optimal platform-specific code.
On aarch64, this produces significant improvements:
- extract_tag: single `lsr xD, xS, #48` instead of `mov` + `lsr`
(3-operand shifts are free on ARM). Saves 1 instruction at 57
call sites.
- unbox_object: single `and xD, xS, #0xffffffffffff` instead of
`mov` + `shl` + `shr`. The 48-bit mask is a valid ARM64 logical
immediate. Saves 2 instructions at 6 call sites.
- box_int32: `mov wD, wS` + `movk xD, #tag, lsl #48` instead of
`mov` + `and 0xFFFFFFFF` + `movabs tag` + `or`. The w-register
mov zero-extends, and movk overwrites just the top 16 bits.
Saves 2 instructions and no longer clobbers t0 (rax).
- box_int32_clean: `movk xD, #tag, lsl #48` (1 instruction) instead
of `mov` + `movabs tag` + `or` (saves 2 instructions, no t0
clobber).
On x86_64, the generated code is equivalent to the old macros.
Add a not32 DSL instruction that operates on the 32-bit sub-register,
zeroing the upper 32 bits (x86_64: not r32, aarch64: mvn w_reg).
Use it in BitwiseNot to avoid the sign-extension (unbox_int32), 64-bit
NOT, and explicit AND 0xFFFFFFFF. The 32-bit NOT produces a clean
upper half, so we can use box_int32_clean directly.
Before: movsxd + not r64 + and 0xFFFFFFFF + and 0xFFFFFFFF + or tag
After: mov + not r32 + or tag
When consecutive branch_fp_* instructions use the same operands (e.g.
branch_fp_unordered followed by branch_fp_equal), the 2nd ucomisd/fcmp
is redundant since flags are still valid from the first comparison.
Track the last FP comparison operands in HandlerState and skip the
comparison instruction when it would be identical. This is common in
the double_equality_compare macro which checks for unordered (NaN)
before testing equality.
Values in the range 0x80000000..0xFFFFFFFF were incorrectly emitted
as plain `mov r64, imm` which GAS encodes as a 10-byte movabs. Use
`mov r32, imm32` instead (5 bytes, implicitly zero-extends to 64
bits). This affects constants like ENVIRONMENT_COORDINATE_INVALID
(0xFFFFFFFE) which appeared 5 times in the generated assembly.
canonicalize_nan previously emitted its full NaN fixup inline:
on x86_64, a 10-byte movabs + cmovp; on aarch64, a multi-instruction
mov sequence + fcsel. These were always on the hot path even though
NaN results from arithmetic are extremely rare.
Move the NaN fixup to a cold block emitted after the handler body.
The hot path is now just: movq/fmov + ucomisd/fcmp + jp/b.vs (a
forward branch predicted not-taken). This removes 14 bytes of
instructions from the hot path of every handler that produces
double results (Add, Sub, Mul, Div, and several builtins).
Both backends gain a HandlerState struct (shared between them) that
accumulates cold fixup blocks during code generation, emitted after
the main body.
Replace the pattern of 64-bit arithmetic + sign-extend + compare
with dedicated 32-bit overflow instructions that use the hardware
overflow flag directly.
Before: add t3, t4 / unbox_int32 t5, t3 / branch_ne t3, t5, .overflow
After: add32_overflow t3, t4, .overflow
On x86_64 this compiles to `add r32, r32; jo label` (the 32-bit
register write implicitly zeros the upper 32 bits). On aarch64,
`adds w, w, w; b.vs label` for add/sub, `smull + sxtw + cmp + b.ne`
for multiply, and `negs + b.vs` for negate.
Nine call sites updated: Add, Sub, Mul, Increment, Decrement,
PostfixIncrement, PostfixDecrement, UnaryMinus, and CallBuiltin(abs).
AsmIntGen is a Rust tool that compiles a custom assembly DSL into
native x86_64 or aarch64 assembly (.S files). It reads struct field
offsets from a generated constants file and instruction layouts from
Bytecode.def (via the BytecodeDef crate) to emit platform-specific
code for each bytecode handler.
The DSL provides a portable instruction set with register aliases,
field access syntax, labels, conditionals, and calls. Each backend
(codegen_x86_64.rs, codegen_aarch64.rs) translates this into the
appropriate platform assembly with correct calling conventions
(SysV AMD64, AAPCS64).