Remove four fields that are trivially derivable from other fields
already present in the ExecutionContext:
- global_object (from realm)
- global_declarative_environment (from realm)
- identifier_table (from executable)
- property_key_table (from executable)
This shrinks ExecutionContext from 192 to 160 bytes (-17%).
The asmint's GetGlobal/SetGlobal handlers now load through the realm
pointer, taking advantage of the cached declarative environment
pointer added in the previous commit.
Realm now caches a direct pointer to the global declarative
environment record, updated when set_global_environment() is called.
This avoids an extra pointer chase through GlobalEnvironment in hot
paths like the asmint's GetGlobal/SetGlobal handlers.
Replace the icu4c-based calendar implementation with one built on the
icu4x Rust crate (icu_calendar).
The icu4c API does not expose the píngqì month-assignment algorithm
used by the Chinese and Dangi lunisolar calendars. Our old code had to
approximate this by walking months via epoch millisecond arithmetic and
manually tracking leap month positions, which produced incorrect month
codes and ordinal month numbers for certain years. The icu4x calendar
crate handles píngqì natively.
With this patch, which is almost a 1-to-1 mapping of ICU invocations, we
pass 100% of all Temporal test262 tests.
The end goal might be to use icu4x for all of our ICU needs. But it does
not yet provide the APIs needed for all ECMA-402 prototypes.
This adds international calendar support to our Temporal implementation,
using the Intl Era and Month Code Proposal as a guide. See:
https://tc39.es/proposal-intl-era-monthcode/
Same as commit f9fa548d43.
These are String from the outset, so this patch is almost entirely just
changing function parameter types. This will allow us to cache calendar
objects in ICU without invoking any extra allocations.
Replace all unwrap_or(0) and parse().unwrap_or(0) calls in the
asmint code generator with expect()/panic! so that missing
constants or unparseable literals cause a build-time failure
instead of silently generating wrong code.
The smull instruction writes a 64-bit result to the destination
register. For negative results like 1 * -1 = -1, this means the
upper 32 bits are all 1s (sign extension of the 64-bit value).
The subsequent box_int32_clean assumed the upper 32 bits were
already zero, so it just set the NaN-boxing tag with movk. This
produced a corrupted Value where strict equality (===) would fail
even though the numeric value was correct.
Fix this by adding a mov wN, wN after the overflow check to
zero-extend the 32-bit result, matching what add32_overflow and
sub32_overflow already do by writing to W registers.
These test only the low 32 bits of a register, replacing the previous
pattern of `and reg, 0xFFFFFFFF` followed by `branch_zero` or
`branch_nonzero`.
On aarch64 the old pattern emitted `mov w1, w1; cbnz x1` (2 insns),
now it's just `cbnz w1` (1 insn). Used in JumpIf, JumpTrue, JumpFalse,
and Not for the int32 truthiness fast path.
x21 (instruction pointer = pb + pc) is already the primary dispatch
register. Maintaining w25 (the 32-bit pc offset) in parallel on every
dispatch_next, goto_handler, and dispatch_variable was redundant.
Compute the 32-bit pc on demand via `sub w1, w21, w26` only when
calling into C++ (slow paths), which is the cold path. This removes
one instruction from every hot dispatch sequence and every jump target.
The generated output shrinks from 4692 to 4345 lines (~347 instructions
removed), with every handler benefiting from shorter dispatch tails.
These three tag constants (0x7FFA, 0x7FF9, 0x7FF8) exceed the 12-bit
cmp immediate range on aarch64, so every comparison required a mov+cmp
pair. Pin them in x22, x23, x24 (callee-saved, previously unused) to
turn ~160 two-instruction sequences into single cmp instructions.
This error was found by asking an LLM to generate additional, related
test cases for the bug affecting https://volkswagen.de fixed in an
earlier commit.
An unconditional call to `copy_if_needed_to_preserve_evaluation_order`
in this place was showing up quiet significantly in the JS benchmarks.
To avoid the regression, there is now a small heuristic that avoids the
unnecessary Mov instruction in the vast majority of cases. This is
likely not the best way to deal with this. But the changes in the
current patch set are focussed on correctness, not performance. So I
opted for a localized, minimal-impact solution to the performance
regression.
This error was found by asking an LLM to generate additional, related
test cases for the bug affecting https://volkswagen.de fixed in an
earlier commit.
This error was found by asking an LLM to generate additional, related
test cases for the bug affecting https://volkswagen.de fixed in an
earlier commit.
`copy_if_needed_to_preserve_evaluation_order` was introduced in
c372a084a2. At that point function
arguments still needed to be copied into registers with a special
`GetArgument` instructions. Later, in
3f04d18ef7 this was changed and arguments
were made their own operand type that can be accessed directly instead.
Similar to locals, arguments can also be overwritten due to evaluation
order in various scenarios. However, the function was never updated to
account for that. Rectify that here.
With this change, https://volkswagen.de no longer gets blanked shortly
after initial load and the unhandled JS exception spam on that site is
gone too.
The last time a new operand type was added, the effects from that on the
function changed in this commit were seemingly not properly considered,
introducing a bug. To avoid such errors in the future, rewrite the code
to produce a compile-time error if new operand types are added.
No functional changes yet, the actual bugfix will be in a
followup-commit.
Teach the DSL and both arch backends to handle memory operands of
the form [pb, pc, field_ref], meaning base + index + field_offset.
On aarch64, since x21 already caches pb + pc (the instruction
pointer), this emits a single `ldr dst, [x21, #offset]` instead of
the previous `mov t0, x21` + `ldr dst, [t0, #offset]` two-instruction
sequence.
On x86_64, this emits `[r14 + r13 + offset]` which is natively
supported by x86 addressing modes.
Convert all `lea t0, [pb, pc]` + `loadNN tX, [t0, field]` pairs in
the DSL to the new single-instruction form, saving one instruction
per IC access and other field loads in GetById, PutById, GetLength,
GetGlobal, SetGlobal, and CallBuiltin handlers.
Instead of storing a u32 index into a cache vector and looking up the
cache at runtime through a chain of dependent loads (load Executable*,
load vector data pointer, multiply index, add), store the actual cache
pointer as a u64 directly in the instruction stream.
A fixup pass (Executable::fixup_cache_pointers()) runs after Executable
construction in both the Rust and C++ pipelines, walking the bytecode
and replacing each index with the corresponding pointer.
The cache pointer type is encoded in Bytecode.def (e.g.
PropertyLookupCache*, GlobalVariableCache*) so the fixup switch is
auto-generated by the Python Op code generator, making it impossible
to forget updating the fixup when adding new cached instructions.
This eliminates 3-4 dependent loads on every inline cache access in
both the C++ interpreter and the assembly interpreter.
Property lookup cache entries previously used GC::Weak<T> for shape,
prototype, and prototype_chain_validity pointers. Each GC::Weak
requires a ref-counted WeakImpl allocation and an extra indirection
on every access.
Replace these with GC::RawPtr<T> and make Executable a WeakContainer
so the GC can clear stale pointers during sweep via remove_dead_cells.
For static PropertyLookupCache instances (used throughout the runtime
for well-known property lookups), introduce StaticPropertyLookupCache
which registers itself in a global list that also gets swept.
Now that inline cache entries use GC::RawPtr instead of GC::Weak,
we can compare shape/prototype pointers directly without going
through the WeakImpl indirection. This removes one dependent load
from each IC check in GetById, PutById, GetLength, GetGlobal, and
SetGlobal handlers.
SimpleIndexedPropertyStorage can only hold default-attributed data
properties. Any attempt to store a property with non-default
attributes (such as accessors) triggers conversion to
GenericIndexedPropertyStorage first. So when we've already verified
is_simple_storage, the accessor check is dead code.
Instead of calling into C++ helpers for global let/const variable
access, inline the binding lookup directly in the asm handlers.
This avoids the overhead of a C++ call for the common case.
Module environments still use the C++ helper since they require
additional lookups that aren't worth inlining.
Pin x21 = pb + pc (the instruction pointer) as a callee-saved register
that survives C++ calls. x21 is set during dispatch and remains valid
throughout the entire handler.
This eliminates redundant `add x9, x26, x25` instructions from every
load_operand, store_operand, load_label, and dispatch_next sequence.
Also optimizes `lea dst, [pb, pc]` to `mov dst, x21`.
For dispatch_next, the next opcode is loaded via `ldrb w9, [x21, #size]`
and x21 is updated incrementally (`add x21, x21, #size`), which also
improves the dependency chain vs recomputing from x26 + x25.
dispatch_current is promoted from a DSL macro to a codegen instruction
so it can set x21 for the next handler.
Load CANON_NAN_BITS into d8 (a callee-saved FP register) at
interpreter entry. This avoids materializing the 64-bit constant
in every canonicalize_nan cold fixup block.
Before: cold block was `movz x9, ... / movk x9, ... / b ret`
After: cold block is just `fmov xD, d8 / b ret`
The hot path (fmov + fcmp + b.vs) is unchanged. The constant is
only needed when the result is actually NaN, which is rare, but
this still shrinks code size and avoids the multi-instruction
immediate materialization at 11 call sites.
Convert extract_tag, unbox_int32, unbox_object, box_int32, and
box_int32_clean from DSL macros into codegen instructions, allowing
each backend to emit optimal platform-specific code.
On aarch64, this produces significant improvements:
- extract_tag: single `lsr xD, xS, #48` instead of `mov` + `lsr`
(3-operand shifts are free on ARM). Saves 1 instruction at 57
call sites.
- unbox_object: single `and xD, xS, #0xffffffffffff` instead of
`mov` + `shl` + `shr`. The 48-bit mask is a valid ARM64 logical
immediate. Saves 2 instructions at 6 call sites.
- box_int32: `mov wD, wS` + `movk xD, #tag, lsl #48` instead of
`mov` + `and 0xFFFFFFFF` + `movabs tag` + `or`. The w-register
mov zero-extends, and movk overwrites just the top 16 bits.
Saves 2 instructions and no longer clobbers t0 (rax).
- box_int32_clean: `movk xD, #tag, lsl #48` (1 instruction) instead
of `mov` + `movabs tag` + `or` (saves 2 instructions, no t0
clobber).
On x86_64, the generated code is equivalent to the old macros.
UnsignedRightShift: after shr on a zero-extended value, upper bits are
already clear.
GetByValue typed array path: load32/load8/load16/load8s/load16s all
write to 32-bit destination registers, zeroing the upper 32 bits.
Both can use box_int32_clean to skip the redundant AND 0xFFFFFFFF.
Add a not32 DSL instruction that operates on the 32-bit sub-register,
zeroing the upper 32 bits (x86_64: not r32, aarch64: mvn w_reg).
Use it in BitwiseNot to avoid the sign-extension (unbox_int32), 64-bit
NOT, and explicit AND 0xFFFFFFFF. The 32-bit NOT produces a clean
upper half, so we can use box_int32_clean directly.
Before: movsxd + not r64 + and 0xFFFFFFFF + and 0xFFFFFFFF + or tag
After: mov + not r32 + or tag
When consecutive branch_fp_* instructions use the same operands (e.g.
branch_fp_unordered followed by branch_fp_equal), the 2nd ucomisd/fcmp
is redundant since flags are still valid from the first comparison.
Track the last FP comparison operands in HandlerState and skip the
comparison instruction when it would be identical. This is common in
the double_equality_compare macro which checks for unordered (NaN)
before testing equality.
In JumpIf, JumpTrue, JumpFalse, and Not, the int32 zero-test path
copied the value to a temporary before masking: mov t3, t1; and t3,
0xFFFFFFFF; branch_zero t3. Since t1 is dead after the test, operate
on it directly: and t1, 0xFFFFFFFF; branch_zero t1. Saves one mov
instruction per handler on the int32 truthiness path.
Add box_int32_clean for sites where the upper 32 bits are already
known to be zero, skipping the redundant zero-extension. On x86_64,
32-bit register writes (add esi, edi; neg esi; etc.) implicitly
clear the upper 32 bits, making the truncation in box_int32
unnecessary.
Use box_int32_clean at 9 call sites after add32_overflow,
sub32_overflow, mul32_overflow, and neg32_overflow, saving one
instruction per site on the hot int32 arithmetic paths.
Values in the range 0x80000000..0xFFFFFFFF were incorrectly emitted
as plain `mov r64, imm` which GAS encodes as a 10-byte movabs. Use
`mov r32, imm32` instead (5 bytes, implicitly zero-extends to 64
bits). This affects constants like ENVIRONMENT_COORDINATE_INVALID
(0xFFFFFFFE) which appeared 5 times in the generated assembly.
canonicalize_nan previously emitted its full NaN fixup inline:
on x86_64, a 10-byte movabs + cmovp; on aarch64, a multi-instruction
mov sequence + fcsel. These were always on the hot path even though
NaN results from arithmetic are extremely rare.
Move the NaN fixup to a cold block emitted after the handler body.
The hot path is now just: movq/fmov + ucomisd/fcmp + jp/b.vs (a
forward branch predicted not-taken). This removes 14 bytes of
instructions from the hot path of every handler that produces
double results (Add, Sub, Mul, Div, and several builtins).
Both backends gain a HandlerState struct (shared between them) that
accumulates cold fixup blocks during code generation, emitted after
the main body.
Replace the check_is_double pattern that loaded the full 64-bit
CANON_NAN_BITS constant (10-byte movabs on x86_64) and masked the
entire value, with a cheaper approach: extract the upper 16-bit tag
and check if (tag & NAN_BASE_TAG) == NAN_BASE_TAG.
This saves instructions at every double-check site. Additionally,
add a check_tag_is_double macro for call sites where the tag has
already been extracted into a register, avoiding redundant
extract_tag operations. This is used in 11 call sites across
coerce_to_doubles, strict_equality_core, numeric_compare, Div,
UnaryPlus, UnaryMinus, and ToInt32.
Replace the pattern of 64-bit arithmetic + sign-extend + compare
with dedicated 32-bit overflow instructions that use the hardware
overflow flag directly.
Before: add t3, t4 / unbox_int32 t5, t3 / branch_ne t3, t5, .overflow
After: add32_overflow t3, t4, .overflow
On x86_64 this compiles to `add r32, r32; jo label` (the 32-bit
register write implicitly zeros the upper 32 bits). On aarch64,
`adds w, w, w; b.vs label` for add/sub, `smull + sxtw + cmp + b.ne`
for multiply, and `negs + b.vs` for negate.
Nine call sites updated: Add, Sub, Mul, Increment, Decrement,
PostfixIncrement, PostfixDecrement, UnaryMinus, and CallBuiltin(abs).
Add a new interpreter that executes bytecode via generated assembly,
written in a custom DSL (asmint.asm) that AsmIntGen compiles to
native x86_64 or aarch64 code.
The interpreter keeps the bytecode program counter and register file
pointer in machine registers for fast access, dispatching opcodes
through a jump table. Hot paths (arithmetic, comparisons, property
access on simple objects) are handled entirely in assembly, with
cold/complex operations calling into C++ helper functions defined
in AsmInterpreter.cpp.
A small build-time tool (gen_asm_offsets) uses offsetof() to emit
struct field offsets as constants consumed by the DSL, ensuring the
assembly stays in sync with C++ struct layouts.
The interpreter is enabled by default on platforms that support it.
The C++ interpreter can be selected via LIBJS_USE_CPP_INTERPRETER=1.
Currently supported platforms:
- Linux/x86_64
- Linux/aarch64
- macOS/x86_64
- macOS/aarch64
AsmIntGen is a Rust tool that compiles a custom assembly DSL into
native x86_64 or aarch64 assembly (.S files). It reads struct field
offsets from a generated constants file and instruction layouts from
Bytecode.def (via the BytecodeDef crate) to emit platform-specific
code for each bytecode handler.
The DSL provides a portable instruction set with register aliases,
field access syntax, labels, conditionals, and calls. Each backend
(codegen_x86_64.rs, codegen_aarch64.rs) translates this into the
appropriate platform assembly with correct calling conventions
(SysV AMD64, AAPCS64).
Move the Bytecode.def parser, field type info, and layout computation
out of Rust/build.rs into a standalone BytecodeDef crate. This allows
both the Rust bytecode codegen (build.rs) and the upcoming AsmIntGen
tool to share a single source of truth for instruction field offsets
and sizes.
The AsmIntGen directory is excluded from the workspace since it has
its own Cargo.toml and is built separately by CMake.
Move Interpreter::get() and set() from the .cpp file into the header
as inline methods. Make handle_exception(), perform_call(),
perform_call_impl(), and the HandleExceptionResponse enum public so
they can be called by the upcoming assembly interpreter's C++ glue
code. Also add set_running_execution_context() for the same reason.
Replace individual bool bitfields in Object (m_is_extensible,
m_has_parameter_map, m_has_magical_length_property, etc.) with a
single u8 m_flags field and Flag:: constants.
This consolidates 8 scattered bitfields into one byte with explicit
bit positions, making them easy to access from generated assembly
code at a known offset. It also converts the virtual is_function()
and is_ecmascript_function_object() methods to flag-based checks,
avoiding virtual dispatch for these hot queries.
ProxyObject now explicitly clears the IsFunction flag in its
constructor when wrapping a non-callable target, instead of relying
on a virtual is_function() override.
This path will replace manual execution-context stack resizing with
vm().pop_execution_context() in the inline unwind paths. Apply this
in both exception unwinding and inline return handling so frame
teardown consistently goes through the VM’s canonical pop logic,
reducing the risk of execution-context stack desynchronization.
Replace three identical error structs (ParserError, ScopeError,
ParsedError) with a single shared ParseError type. Since all three
had the same fields (message, line, column), having separate types
only added verbose field-by-field copying at each boundary.
Now errors flow directly from parser/scope collector into
ParsedProgram without conversion.
Add compile_parsed_module() to RustIntegration, which takes a
RustParsedProgram and a SourceCode (from parse_program with
ProgramType::Module) and compiles it on the main thread with GC
interaction.
Rewrite compile_module() to use the new split functions internally.
Add SourceTextModule::parse_from_pre_parsed() and
JavaScriptModuleScript::create_from_pre_parsed() to allow creating
module scripts from a pre-parsed RustParsedProgram.
This prepares the infrastructure for off-thread module parsing.
Add rust_compile_parsed_module() which takes a ParsedProgram (from
rust_parse_program with type=Module) and compiles it with GC
interaction. This extracts import/export metadata, compiles the
module body to bytecode, and extracts declaration data.
Rewrite rust_compile_module() to delegate to rust_parse_program()
followed by rust_compile_parsed_module() internally, matching the
rust_compile_script() pattern.
Create a SourceCode on the main thread (performing UTF-8 to UTF-16
conversion), then submit parse_program() to the ThreadPool for
Rust parsing on a worker thread. This unblocks the WebContent event
loop during external script loading.
Add Script::create_from_parsed() and
ClassicScript::create_from_pre_parsed() factory methods that take a
pre-parsed RustParsedProgram and a SourceCode, performing only the
GC-allocating compile step on the main thread.
Falls back to synchronous parsing when the Rust pipeline is
unavailable (LIBJS_CPP=1 or LIBJS_COMPARE_PIPELINES=1).
Expose the Rust parse/compile split to C++ callers:
- parse_program(): takes raw UTF-16 data and a ProgramType
parameter (Script or Module). No GC interaction, thread-safe.
- compile_parsed_script(): takes a pre-parsed RustParsedProgram
and a SourceCode, checks for errors, and calls
rust_compile_parsed_script(). Returns a ScriptResult.
Rewrite compile_script() to use the split path internally. The
pipeline comparison logic now gets the AST dump from the
ParsedProgram before compilation consumes it.
Add a ParsedProgram struct that holds the parsed AST, function table,
scope data, and strictness flag without any GC references. This
enables future off-thread parsing since the parse step makes zero
GC allocations.
The type is called ParsedProgram (not ParsedScript) because it will
be used for both scripts and modules. It takes a program_type
parameter (0 = Script, 1 = Module) to handle both cases.
New FFI functions:
- rust_parse_program(): lex, parse, scope analysis (no VM/GC needed)
- rust_compile_parsed_script(): codegen + GDI extraction (needs VM)
- rust_parsed_program_has_errors(): check for parse errors
- rust_parsed_program_take_errors(): report errors via callback
- rust_parsed_program_ast_dump(): lazily generate AST dump string
- rust_free_parsed_program(): free without compiling
Rewrite rust_compile_script() to call rust_parse_program() followed
by rust_compile_parsed_script() internally, preserving the existing
behavior and API.
Move regex compilation out of the parsing hot path. Both the C++ and
Rust parsers now collect raw regex pattern+flags strings during parsing
and batch-compile them after parsing completes.
This is a prerequisite for moving the Rust parser to a background
thread, since LibRegex is thread-unsafe and FFI calls during parsing
prevent parallelization.
Flag validation remains in the parser since it's trivial string
checking with no LibRegex dependency.