ladybird

mirror of https://github.com/LadybirdBrowser/ladybird synced 2026-05-11 01:22:43 +02:00

Author	SHA1	Message	Date
Andreas Kling	4780d25df5	AsmIntGen: Avoid aarch64 helper-call bridge moves The named 3-operand call_helper form lets the allocator see the helper input and output as distinct temporaries. On aarch64 those values do not overlap: the input dies at the call boundary, and the output is born from the return value. Pin both temporaries to x0, which is both the first AAPCS64 argument register and the return register. This lets the aarch64 codegen omit the old mov x0, x1 bridge before named call_helper uses, while leaving the legacy 1-operand convention alone. Add an allocator test for the aarch64 pinning so the calling-convention intent stays explicit.	2026-04-26 13:29:56 +02:00
Andreas Kling	f90710e571	AsmIntGen: Bias the allocator toward cheap x86_64 encodings The greedy register allocator already preferred low-numbered GPRs by virtue of the pool being listed in cost order, but the policy was implicit and the temp-processing order was driven purely by live-range length. That meant a long-lived but rarely-referenced temp could grab `rax` and push a hot temp into a more expensive register. Make the encoding-cost policy explicit and route hot temps to cheap registers: * `register_cost` in registers.rs scores each physical register by encoding cost. `rax`/`eax` is cheapest because of the accumulator- form short encodings (`add eax, imm32` is one byte shorter than `add r/m32, imm32`). The other classic GPRs avoid the REX prefix in 32-bit forms. `r8`..`r15` always need a REX extension byte. aarch64 is cost-uniform. * The allocator now sorts named temps by use count first (so hot temps grab cheap registers), with live-range length as a tiebreaker. Greedy graph coloring is order-sensitive, so the `Call` handler -- which packs 51 temps into 9 registers -- needs the live-range-first order to color at all. We try cost-first and fall back to fit-first only when the first attempt cannot color. * Register selection now picks the cheapest available register explicitly via `min_by_key(register_cost)` instead of relying on pool position. Pool order breaks ties for determinism.	2026-04-26 13:29:56 +02:00
Andreas Kling	928a9dfbf7	LibJS+AsmIntGen: Retire the positional t0..t8 / ft0..ft3 DSL aliases asmint.asm no longer references any positional temp register name -- every handler and macro declares its temporaries by name with `temp` / `ftemp` and lets the register allocator place them. Migrate the last two macros holding out: * dispatch_current uses a macro-local `opcode` temp for the load8 + indirect jmp. * pop_inline_frame_and_resume names its return-pc, dst-index, value- address, vm-pointer, and executable temps explicitly. With nothing left referring to the positional aliases, drop the tN / ftN -> physical-register fallback from registers::resolve_register and update the DSL reference comments at the top of asmint.asm and in main.rs to describe the named-temp model. The two pre-existing codegen tests that probed the old positional behavior get rewritten to use the post-allocation physical-register names directly, since that is now the actual contract of resolve_op.	2026-04-26 13:29:56 +02:00
Andreas Kling	9e6a205575	LibJS: Migrate Div and Mod handlers to named DSL temporaries Convert the Div and Mod handlers to named DSL temporaries. Mod is the first migrated handler that uses divmod, with its quot/rem pinned to rax/rdx by fixed_operands and its dividend/divisor kept off both by the operand-vs-implicit-output interference rule.	2026-04-26 13:29:56 +02:00
Andreas Kling	197ed3de24	AsmIntGen: Add register allocator for named DSL temporaries Introduce an allocator that runs whenever a handler -- or any macro it transitively invokes -- declares named temporaries with `temp` / `ftemp`. The allocator: * Pre-expands macros into a flat instruction list and uniquifies self-contained labels and `temp` declarations per macro expansion, so two invocations of the same macro never share names. * Computes per-instruction use/def/kill sets from InstructionInfo, accounting for hidden clobbers, implicit register inputs/outputs, and the all-caller-saved kill at non-terminal C++ calls. * Runs iterative backward-dataflow liveness so branches and macro-introduced loops are handled correctly. * Greedily picks a physical register for each named temp from the public DSL pool, honoring fixed-operand constraints (e.g. x86 shifts demand the count in rcx, divmod demands rax/rdx). * Hard-errors when a temp cannot be placed instead of spilling. * Rewrites operands so the existing codegen sees only physical register names. Handlers that don't use named temps continue to flow through the existing recursive macro-expansion path, so generated assembly is unchanged for the unmigrated asmint.asm. New unit tests cover simple allocation, interference, fixed-operand pinning, double declaration, positional-alias shadowing, calls killing live temps, GPR/FPR pool separation, and macro-local uniquification.	2026-04-26 13:29:56 +02:00
Andreas Kling	63b356e38a	AsmIntGen: Accept `temp` and `ftemp` declarations in the DSL Add the syntax that user-facing handler bodies will eventually use to introduce named GPR and FPR temporaries: temp foo, bar ftemp baz The parser already produces the right IR for these (an instruction with the literal mnemonic `temp` / `ftemp` and a list of identifier operands); both codegens now treat the mnemonics as no-ops so the catch-all panic does not fire when handlers contain declarations. The register allocator that consumes them is not yet wired up, so existing positional usage of t0..t8 / ft0..ft3 continues to work. Document the new syntax in the DSL reference and add parser tests covering the single- and multi-name forms.	2026-04-26 13:29:56 +02:00
Andreas Kling	90bc456305	AsmIntGen: Reserve aarch64 codegen scratch outside the DSL temp pool Drop t9..t17 from the aarch64 temporaries list. Those names mapped to x9..x17, but the codegen unconditionally uses x9 as a scratch register for materializing large immediates, dispatch tails, and pair-memory base computations, and uses x10 as a secondary scratch in dispatch and pair-memory paths. Exposing them as DSL names was a footgun: any user that wrote `t9` would have its value silently overwritten the next time the codegen needed scratch. Document the reserved-scratch convention on the RegisterMapping doc comment, and update the DSL reference in main.rs to the accurate t0..t8 range.	2026-04-26 13:29:56 +02:00
Andreas Kling	f1afb01345	AsmIntGen: Add InstructionInfo metadata table Introduce a canonical table that describes every DSL instruction's operand kinds, control-flow behavior, and per-architecture register footprint: hidden scratch clobbers, implicit register inputs/outputs, and hard-fixed operand register requirements (e.g. x86 shifts demand the count in rcx, divmod writes rax/rdx). The table will back a register allocator for named DSL temporaries. Today nothing consumes it; this commit just lands the data and a few unit tests that hold the table in sync with the codegen mnemonic set.	2026-04-26 13:29:56 +02:00
Andreas Kling	583fa475fb	LibJS: Call RawNativeFunction directly from asm Call The asm interpreter already inlines ECMAScript calls, but builtin calls still went through the generic C++ Call slow path even when the callee was a plain native function pointer. That added an avoidable boundary around hot builtin calls and kept asm from taking full advantage of the new RawNativeFunction representation. Teach the asm Call handler to recognize RawNativeFunction, allocate the callee frame on the interpreter stack, copy the call-site arguments, and jump straight to the stored C++ entry point. NativeJavaScriptBackedFunction and other non-raw callees keep falling through to the existing C++ slow path unchanged.	2026-04-15 15:57:48 +02:00
Andreas Kling	517812647a	LibJS: Pack asm Call shared-data metadata Pack the asm Call fast path metadata next to the executable pointer so the interpreter can fetch both values with one paired load. This removes several dependent shared-data loads from the hot path. Keep the executable pointer and packed metadata in separate registers through this binding so the fast path can still use the paired-load layout after any non-strict this adjustment. Lower the packed metadata flag checks correctly on x86_64 as well. Those bits now live above bit 31, so the generator uses bt for single- bit high masks and covers that path with a unit test. Add a runtime test that exercises both object and global this binding through the asm Call fast path.	2026-04-14 12:37:12 +02:00
Andreas Kling	fa931612e1	LibJS: Pair-store the asm Call frame setup Teach the asm Call fast path to use paired stores for the fixed ExecutionContext header writes and for the caller linkage fields. This also initializes the five reserved Value slots directly instead of looping over them as part of the general register clear path. That keeps the hot frame setup work closer to the actual data layout: reserved registers are seeded with a couple of fixed stores, while the remaining register and local slots are cleared in wider chunks. On x86_64, keep the new explicit-offset formatting on store_pair* and load_pair* without changing ordinary [base, index, scale] operands into base-plus-index-plus-offset addresses. Add unit tests covering both the paired zero-offset form and the preserved scaled-index lowering.	2026-04-14 12:37:12 +02:00
Andreas Kling	fcbbc6a4b8	LibJS: Add paired stores to the AsmInt DSL Teach AsmIntGen about store_pair32 and store_pair64 so hot handlers can describe adjacent writes just as explicitly as adjacent reads. The DSL now requires naming both memory operands and rejects non-adjacent or reordered pairs at code generation time. On aarch64 the new instructions lower to stp when the address is encodable, while x86_64 keeps the same semantics with two scalar stores. The shared validation keeps the paired access rules consistent across both load and store primitives.	2026-04-14 12:37:12 +02:00
Andreas Kling	ce753047b0	LibJS: Add verifiable paired loads to the AsmInt DSL Add load_pair32 and load_pair64 to the AsmInt DSL and make the generator verify that both named memory operands are truly adjacent. That keeps paired loads self-documenting in the DSL instead of hiding the second field behind an implicit adjacency assumption. AArch64 now lowers valid pairs to ldp when the address form allows it, while x86_64 keeps the same behavior with two obvious scalar loads. Add unit tests for the shared validator so reversed or non-adjacent field pairs are rejected during code generation.	2026-04-14 12:37:12 +02:00
Andreas Kling	4405c52042	LibJS: Zero-extend 32-bit AArch64 asm immediates Teach the AArch64 AsmInt generator to materialize immediates through w-register writes when the upper 32 bits are known zero. That keeps the same x-register value while letting common constants use shorter instruction sequences.	2026-04-14 08:14:43 +02:00
Andreas Kling	960a36db53	LibJS: Lower zero store immediates to zero registers on AArch64 Teach the AArch64 AsmInt generator to lower zero-immediate stores through xzr or wzr instead of materializing a temporary register. This covers store64 as well as the narrow store8, store16, and store32 forms, keeping the generated code shorter on the zero store fast path.	2026-04-14 08:14:43 +02:00
Andreas Kling	87797e9161	LibJS: Use tbz and tbnz for single-bit asm branches AsmIntGen already lowers branch_zero and branch_nonzero to the compact AArch64 branch-on-bit forms when possible, but branch_bits_set and branch_bits_clear still expanded single-bit immediates into tst plus a separate conditional branch. Teach the AArch64 backend to recognize power-of-two masks and emit tbnz or tbz directly. This shortens several hot interpreter paths.	2026-04-14 08:14:43 +02:00
Andreas Kling	b1dab18e42	LibJS: Teach AsmIntGen helper primitives Add load_vm, memory-operand macro substitution, and a generic inc32_mem instruction to the AsmInt DSL. Also drop redundant mov reg, reg copies in the backends so handlers that use the new helpers expand to cleaner assembly.	2026-04-14 08:14:43 +02:00
Andreas Kling	2ca7dfa649	LibJS: Move bytecode interpreter state to VM The bytecode interpreter only needed the running execution context, but still threaded a separate Interpreter object through both the C++ and asm entry points. Move that state and the bytecode execution helpers onto VM instead, and teach the asm generator and slow paths to use VM directly.	2026-04-13 18:29:43 +02:00
Andreas Kling	114eeddea1	AsmIntGen: Avoid clobbering r11 in store_operand The x86_64 asm interpreter mapped t8 to r11, but store_operand also used r11 as its scratch register for operand loads. When a handler stored a JS value from t8, the scratch load overwrote the value first and wrote raw operand bits into the register file.	2026-04-10 15:12:53 +02:00
Andreas Kling	1ff61754a7	LibJS: Re-box double arithmetic results as Int32 when possible When the asmint computes a double result for Add, Sub, Mul, Math.floor, Math.ceil, or Math.sqrt, try to store it as Int32 if the value is a whole number in [INT32_MIN, INT32_MAX] and not -0.0. This mirrors the JS::Value(double) constructor and allows downstream int32 fast paths to fire. Also add label uniquification to the DSL macro expander so the same macro can be used multiple times in one handler without label collisions.	2026-03-19 09:42:04 +01:00
Andreas Kling	5e403af5be	LibJS: Tighten asmint ToInt32 boxing Teach js_to_int32 to leave a clean low 32-bit result on success, then use box_int32_clean in the ToInt32 fast path and adjacent boolean coercions. This removes one instruction from the AArch64 fjcvtzs path and trims the boolean boxing path without changing behavior.	2026-03-19 09:42:04 +01:00
Andreas Kling	645f481825	LibJS: Fast-path Float32Array indexed access Add the small AsmIntGen float32 load, store, and conversion operations needed to handle Float32Array directly in the AsmInt typed-array GetByValue and PutByValue paths. This covers direct indexed reads plus both int32 and double stores, and adds regression coverage for Math.fround rounding, negative zero, and NaN.	2026-03-19 09:42:04 +01:00
Luke Wilde	75725e283d	LibJS/AsmIntGen: Emit unwind info for AArch64 This restores backtrace functionality on AArch64 when crossing a bytecode call frame, where it would previously stop.	2026-03-16 19:30:40 -05:00
Luke Wilde	1378d37e92	LibJS/AsmIntGen: Emit unwind info for x86_64 This restores backtrace functionality on x86_64 when crossing a bytecode call frame, where it would previously stop.	2026-03-16 19:30:40 -05:00
Andreas Kling	c3deaa4746	AsmIntGen: Panic on missing constants Replace all unwrap_or(0) and parse().unwrap_or(0) calls in the asmint code generator with expect()/panic! so that missing constants or unparseable literals cause a build-time failure instead of silently generating wrong code.	2026-03-08 23:04:55 +01:00
Andreas Kling	30d7b7db20	AsmIntGen: Fix aarch64 mul32_overflow leaving garbage in upper 32 bits The smull instruction writes a 64-bit result to the destination register. For negative results like 1 * -1 = -1, this means the upper 32 bits are all 1s (sign extension of the 64-bit value). The subsequent box_int32_clean assumed the upper 32 bits were already zero, so it just set the NaN-boxing tag with movk. This produced a corrupted Value where strict equality (===) would fail even though the numeric value was correct. Fix this by adding a mov wN, wN after the overflow check to zero-extend the 32-bit result, matching what add32_overflow and sub32_overflow already do by writing to W registers.	2026-03-08 23:04:55 +01:00
Andreas Kling	d5eed2632f	AsmInt: Add branch_zero32/branch_nonzero32 to the asmint DSL These test only the low 32 bits of a register, replacing the previous pattern of `and reg, 0xFFFFFFFF` followed by `branch_zero` or `branch_nonzero`. On aarch64 the old pattern emitted `mov w1, w1; cbnz x1` (2 insns), now it's just `cbnz w1` (1 insn). Used in JumpIf, JumpTrue, JumpFalse, and Not for the int32 truthiness fast path.	2026-03-08 23:04:55 +01:00
Andreas Kling	2db4d30e56	AsmIntGen: Stop maintaining w25 (pc) on every asmint dispatch on aarch64 x21 (instruction pointer = pb + pc) is already the primary dispatch register. Maintaining w25 (the 32-bit pc offset) in parallel on every dispatch_next, goto_handler, and dispatch_variable was redundant. Compute the 32-bit pc on demand via `sub w1, w21, w26` only when calling into C++ (slow paths), which is the cold path. This removes one instruction from every hot dispatch sequence and every jump target. The generated output shrinks from 4692 to 4345 lines (~347 instructions removed), with every handler benefiting from shorter dispatch tails.	2026-03-08 23:04:55 +01:00
Andreas Kling	949454feb9	AsmIntGen: Pin {INT32,BOOLEAN,NAN_BASE}_TAG in callee-saved registers These three tag constants (0x7FFA, 0x7FF9, 0x7FF8) exceed the 12-bit cmp immediate range on aarch64, so every comparison required a mov+cmp pair. Pin them in x22, x23, x24 (callee-saved, previously unused) to turn ~160 two-instruction sequences into single cmp instructions.	2026-03-08 23:04:55 +01:00
Andreas Kling	368efef620	AsmIntGen: Support [pb, pc, field] three-operand memory access Teach the DSL and both arch backends to handle memory operands of the form [pb, pc, field_ref], meaning base + index + field_offset. On aarch64, since x21 already caches pb + pc (the instruction pointer), this emits a single `ldr dst, [x21, #offset]` instead of the previous `mov t0, x21` + `ldr dst, [t0, #offset]` two-instruction sequence. On x86_64, this emits `[r14 + r13 + offset]` which is natively supported by x86 addressing modes. Convert all `lea t0, [pb, pc]` + `loadNN tX, [t0, field]` pairs in the DSL to the new single-instruction form, saving one instruction per IC access and other field loads in GetById, PutById, GetLength, GetGlobal, SetGlobal, and CallBuiltin handlers.	2026-03-08 10:27:13 +01:00
Andreas Kling	8936cda523	AsmIntGen: Cache pb+pc in callee-saved x21 on aarch64 Pin x21 = pb + pc (the instruction pointer) as a callee-saved register that survives C++ calls. x21 is set during dispatch and remains valid throughout the entire handler. This eliminates redundant `add x9, x26, x25` instructions from every load_operand, store_operand, load_label, and dispatch_next sequence. Also optimizes `lea dst, [pb, pc]` to `mov dst, x21`. For dispatch_next, the next opcode is loaded via `ldrb w9, [x21, #size]` and x21 is updated incrementally (`add x21, x21, #size`), which also improves the dependency chain vs recomputing from x26 + x25. dispatch_current is promoted from a DSL macro to a codegen instruction so it can set x21 for the next handler.	2026-03-07 22:18:22 +01:00
Andreas Kling	8afa1df951	AsmIntGen: Pin canonical NaN in callee-saved d8 on aarch64 Load CANON_NAN_BITS into d8 (a callee-saved FP register) at interpreter entry. This avoids materializing the 64-bit constant in every canonicalize_nan cold fixup block. Before: cold block was `movz x9, ... / movk x9, ... / b ret` After: cold block is just `fmov xD, d8 / b ret` The hot path (fmov + fcmp + b.vs) is unchanged. The constant is only needed when the result is actually NaN, which is rare, but this still shrinks code size and avoids the multi-instruction immediate materialization at 11 call sites.	2026-03-07 22:18:22 +01:00
Andreas Kling	e486ad2c0c	AsmIntGen: Use platform-optimal codegen for NaN-boxing operations Convert extract_tag, unbox_int32, unbox_object, box_int32, and box_int32_clean from DSL macros into codegen instructions, allowing each backend to emit optimal platform-specific code. On aarch64, this produces significant improvements: - extract_tag: single `lsr xD, xS, #48` instead of `mov` + `lsr` (3-operand shifts are free on ARM). Saves 1 instruction at 57 call sites. - unbox_object: single `and xD, xS, #0xffffffffffff` instead of `mov` + `shl` + `shr`. The 48-bit mask is a valid ARM64 logical immediate. Saves 2 instructions at 6 call sites. - box_int32: `mov wD, wS` + `movk xD, #tag, lsl #48` instead of `mov` + `and 0xFFFFFFFF` + `movabs tag` + `or`. The w-register mov zero-extends, and movk overwrites just the top 16 bits. Saves 2 instructions and no longer clobbers t0 (rax). - box_int32_clean: `movk xD, #tag, lsl #48` (1 instruction) instead of `mov` + `movabs tag` + `or` (saves 2 instructions, no t0 clobber). On x86_64, the generated code is equivalent to the old macros.	2026-03-07 22:18:22 +01:00
Andreas Kling	3e7fa8b09a	AsmInt: Use 32-bit NOT in BitwiseNot handler Add a not32 DSL instruction that operates on the 32-bit sub-register, zeroing the upper 32 bits (x86_64: not r32, aarch64: mvn w_reg). Use it in BitwiseNot to avoid the sign-extension (unbox_int32), 64-bit NOT, and explicit AND 0xFFFFFFFF. The 32-bit NOT produces a clean upper half, so we can use box_int32_clean directly. Before: movsxd + not r64 + and 0xFFFFFFFF + and 0xFFFFFFFF + or tag After: mov + not r32 + or tag	2026-03-07 22:18:22 +01:00
Andreas Kling	6492c88ad8	AsmIntGen: Elide redundant FP comparisons in consecutive branch_fp_* When consecutive branch_fp_* instructions use the same operands (e.g. branch_fp_unordered followed by branch_fp_equal), the 2nd ucomisd/fcmp is redundant since flags are still valid from the first comparison. Track the last FP comparison operands in HandlerState and skip the comparison instruction when it would be identical. This is common in the double_equality_compare macro which checks for unordered (NaN) before testing equality.	2026-03-07 22:18:22 +01:00
Andreas Kling	472edb3448	AsmIntGen: Use mov r32 for unsigned 32-bit immediates on x86_64 Values in the range 0x80000000..0xFFFFFFFF were incorrectly emitted as plain `mov r64, imm` which GAS encodes as a 10-byte movabs. Use `mov r32, imm32` instead (5 bytes, implicitly zero-extends to 64 bits). This affects constants like ENVIRONMENT_COORDINATE_INVALID (0xFFFFFFFE) which appeared 5 times in the generated assembly.	2026-03-07 22:18:22 +01:00
Andreas Kling	c6fd52e317	AsmIntGen: Move NaN canonicalization to cold fixup blocks canonicalize_nan previously emitted its full NaN fixup inline: on x86_64, a 10-byte movabs + cmovp; on aarch64, a multi-instruction mov sequence + fcsel. These were always on the hot path even though NaN results from arithmetic are extremely rare. Move the NaN fixup to a cold block emitted after the handler body. The hot path is now just: movq/fmov + ucomisd/fcmp + jp/b.vs (a forward branch predicted not-taken). This removes 14 bytes of instructions from the hot path of every handler that produces double results (Add, Sub, Mul, Div, and several builtins). Both backends gain a HandlerState struct (shared between them) that accumulates cold fixup blocks during code generation, emitted after the main body.	2026-03-07 22:18:22 +01:00
Andreas Kling	5b8114a96b	AsmInt: Use hardware overflow flag for int32 arithmetic Replace the pattern of 64-bit arithmetic + sign-extend + compare with dedicated 32-bit overflow instructions that use the hardware overflow flag directly. Before: add t3, t4 / unbox_int32 t5, t3 / branch_ne t3, t5, .overflow After: add32_overflow t3, t4, .overflow On x86_64 this compiles to `add r32, r32; jo label` (the 32-bit register write implicitly zeros the upper 32 bits). On aarch64, `adds w, w, w; b.vs label` for add/sub, `smull + sxtw + cmp + b.ne` for multiply, and `negs + b.vs` for negate. Nine call sites updated: Add, Sub, Mul, Increment, Decrement, PostfixIncrement, PostfixDecrement, UnaryMinus, and CallBuiltin(abs).	2026-03-07 22:18:22 +01:00
Andreas Kling	9ae5445493	LibJS: Add AsmIntGen assembly interpreter code generator AsmIntGen is a Rust tool that compiles a custom assembly DSL into native x86_64 or aarch64 assembly (.S files). It reads struct field offsets from a generated constants file and instruction layouts from Bytecode.def (via the BytecodeDef crate) to emit platform-specific code for each bytecode handler. The DSL provides a portable instruction set with register aliases, field access syntax, labels, conditionals, and calls. Each backend (codegen_x86_64.rs, codegen_aarch64.rs) translates this into the appropriate platform assembly with correct calling conventions (SysV AMD64, AAPCS64).	2026-03-07 13:09:59 +01:00

39 Commits