Commit Graph

410 Commits

Author SHA1 Message Date
Andreas Kling
73812e12d2 LibJS: Fast path Array.prototype.indexOf on packed arrays
Skip the generic HasProperty and Get loop when indexOf operates on a
simple packed array. In that case every index below length is an own
data property, so a direct scan of the packed indexed property storage
gives the same strict-equality result without the per-element property
lookup
ceremony.

Only use the fast path when the current packed storage size still
matches the length captured before fromIndex coercion, since that
coercion can run user code and mutate the receiver. Add coverage for
length and storage mutations during fromIndex coercion.
2026-04-27 08:39:37 +02:00
Andreas Kling
e65e85cb8c LibJS: Materialize arguments object for shorthand { arguments }
The parser only set `might_need_arguments_object` when an `arguments`
or `eval` Identifier went through `consume()`, but shorthand object
properties create the reference via `make_identifier()` directly. As
a result `function f() { return { arguments } }` allocated an
`arguments` local, never initialized it, and crashed at runtime when
the property was read.

Fall back to scope-driven detection: if scope analysis allocated a
non-lexical `arguments` local for the function, treat it as a real
arguments-object reference and emit `CreateArguments`. Skip the
fallback when a function declaration named `arguments` claims the
local, since that local belongs to the function, not the arguments
object.

Add a runtime test covering shorthand inside a free function and a
method, plus a regression test for `({ eval } = ...)` to confirm
destructuring assignment doesn't accidentally trigger arguments
materialization.
2026-04-27 08:04:11 +02:00
Andreas Kling
c1bc0cdfa9 LibJS: Allocate local variable indices in source order
The scope collector stored identifier_groups and variables in
HashMaps and then sorted them alphabetically before assigning local
register indices. The sorts existed only because HashMap iteration
order is non-deterministic; alphabetical was a stable choice for
comparing bytecode against the now-removed C++ port.

Switch both maps to indexmap::IndexMap so iteration follows the order
of first reference (= source order), and drop the alphabetical sorts.
Local indices now reflect declaration order, which matches what shows
up in bytecode dumps and is easier to read alongside the source.

Add a focused bytecode test using zebra/yak/aardvark to pin the new
allocation order; existing tests using let/var declarations have
their local indices renumbered to match.
2026-04-27 08:04:11 +02:00
Andreas Kling
010deec578 LibJS: Build functions_to_initialize in source order
ECMAScript hoisting keeps the LAST function declaration with a given
name. The Rust scope_collector and script GDI extraction implemented
this with a single reverse scan that pushed first-seen entries, which
left the resulting list in REVERSE source order. The C++ side then
iterated `m_functions_to_initialize.in_reverse()` to undo that.

Switch the Rust side to a two-pass forward scan that records the last
position per name and emits entries in source order, and drop the
matching `.in_reverse()` calls in Script.cpp and AbstractOperations.cpp.
Same hoisting semantics; NewFunction emission and global property
iteration order now follow the source.

The HashMap that tracks last positions is keyed on `SharedUtf16String`,
so each insert is a refcount bump on the AST's existing Rc instead of
a deep `Vec<u16>` clone.

Add bytecode tests at script and nested-function scope that exercise
multiple declarations and a duplicate name to pin the new ordering.
2026-04-27 08:04:11 +02:00
Andreas Kling
30394ece8d LibJS: Use natural source positions for parser-synthesized identifiers
The Rust parser used to copy several "rule_start"-derived positions
from the C++ implementation: every identifier inside a binding pattern
inherited the pattern's `[`/`{` position, every property identifier
after `.` inherited the period's position, every spread element
inherited the surrounding `[`/`{` position, and identifier-name
property keys inherited the object/class start position. This was
useful while comparing bytecode against the C++ port; with the C++
side gone, those quirks just hide the actual source positions in
source maps and devtools.

Drop the dedicated `binding_pattern_start` parser field and the
`ident_pos_override` parameter on `parse_property_key`, and capture
each identifier's own start position at the consume site.

Add an AST snapshot test that pins the new per-identifier positions
for object, array, nested, and parameter binding patterns.
2026-04-27 08:04:11 +02:00
Andreas Kling
cec0be6f3d LibJS: Replace in_property_key_context flag with explicit consume helper
The parser used to suppress the arguments/eval reference check via a
state flag that was set during the entire `parse_property_key` call.
That was over-broad: identifiers inside a computed property key like
`{ [arguments]: 1 }` are real references, but the flag silenced their
check too, leaving the function unmarked as needing the arguments
object. Reading the resulting property at runtime crashed.

Replace the flag with a `consume_property_key_token()` method used at
the specific consume sites for the property key token itself, so the
suppression is narrow. Inner consumes inside computed keys now go
through regular `consume()` and run the check normally.

Add a focused AST snapshot test covering plain, shorthand, computed,
binding-pattern, and method-name property-key cases.
2026-04-27 08:04:11 +02:00
Timothy Flynn
12d9aaebb3 LibJS: Remove gc from the global object
No other engine defines this function, so it is an observable difference
of our engine. This traces back to the earliest days of LibJS.

We now define `gc` in just the test-js and test262 runners.
2026-04-24 18:36:23 +02:00
Aliaksandr Kalenik
bfbc3352b5 LibJS: Extend Array.prototype.shift() fast path to holey arrays
indexed_take_first() already memmoves elements down for both Packed and
Holey storage, but the caller at ArrayPrototype::shift() only entered
the fast path for Packed arrays. Holey arrays fell through to the
spec-literal per-element loop (has_property / get / set /
delete_property_or_throw), which is substantially slower.

Add a separate Holey predicate with the additional safety checks the
spec semantics require: default_prototype_chain_intact() (so
HasProperty on a hole doesn't escape to a poisoned prototype) and
extensible() (so set() on a hole slot doesn't create a new own
property on a non-extensible object). The existing Packed predicate
is left unchanged -- packed arrays don't need these checks because
every index in [0, size) is already an own data property.

Allows us to fail at Cloudflare Turnstile way much faster!
2026-04-23 21:47:21 +02:00
Andreas Kling
eb9432fcb8 LibJS: Preserve source positions in bytecode source maps
Carry full source positions through the Rust bytecode source map so
stack traces and other bytecode-backed source lookups can use them
directly.

This keeps exception-heavy paths from reconstructing line and column
information through SourceCode::range_from_offsets(), which can spend a
lot of time building SourceCode's position cache on first use.

We're trading some space for time here, but I believe it's worth it at
this tag, as this saves ~250ms of main thread time while loading
https://x.com/ on my Linux machine. :^)

Reading the stored Position out of the source map directly also exposed
two things masked by the old range_from_offsets() path: a latent
off-by-one in Lexer::new_at_offset() (its consume() bumped line_column
past the character at offset; only synthesize_binding_pattern() hit it),
and a (1,1) fallback in range_from_offsets() that fired whenever the
queried range reached EOF. Fix the lexer, then rebaseline both the
bytecode dump tests (no more spurious "1:1") and the destructuring AST
tests (binding-pattern identifiers now report their real columns).
2026-04-22 22:34:54 +02:00
Andreas Kling
51758f3022 LibJS: Make bytecode register allocator O(1)
Generator::allocate_register used to scan the free pool to find the
lowest-numbered register and then Vec::remove it, making every
allocation O(n) in the size of the pool. When loading https://x.com/
on my Linux machine, we spent ~800ms in this function alone!

This logic only existed to match the C++ register allocation ordering
while transitioning from C++ to Rust in the LibJS compiler, so now
we can simply get rid of it and make it instant. :^)

So drop the "always hand out the lowest-numbered free register" policy
and use the pool as a plain LIFO stack. Pushing and popping the back
of the Vec are both O(1), and peak register usage is unchanged since
the policy only affects which specific register gets reused, not how
aggressively.
2026-04-21 13:59:55 +02:00
Andreas Kling
e5d4c5cce8 LibJS: Check TDZ state in asm environment calls
GetCalleeAndThisFromEnvironment treated a binding as initialized when
its value slot was not <empty>. Declarative bindings do not encode TDZ
in that slot, though: uninitialized bindings keep a separate initialized
flag and their value starts as undefined.

That let the first slow-path TDZ failure populate the environment cache,
then a second call at the same site reused the cached coordinate and
turned the required ReferenceError into a TypeError from calling
undefined.

Check Binding.initialized in the asm fast path instead and cover the
cached second-hit case with a regression test.
2026-04-20 11:23:34 +02:00
Timothy Flynn
10ce847931 LibJS+LibUnicode: Use LibUnicode as appropriate for lexing JavaScript
Now that LibUnicode exports its character type APIs in Rust, we can use
them to lex identifiers and whitespace.

Fixes #8870.
2026-04-19 10:39:26 +02:00
Andreas Kling
583fa475fb LibJS: Call RawNativeFunction directly from asm Call
The asm interpreter already inlines ECMAScript calls, but builtin calls
still went through the generic C++ Call slow path even when the callee
was a plain native function pointer. That added an avoidable boundary
around hot builtin calls and kept asm from taking full advantage of the
new RawNativeFunction representation.

Teach the asm Call handler to recognize RawNativeFunction, allocate the
callee frame on the interpreter stack, copy the call-site arguments,
and jump straight to the stored C++ entry point.
NativeJavaScriptBackedFunction and other non-raw callees keep falling
through to the existing C++ slow path unchanged.
2026-04-15 15:57:48 +02:00
Andreas Kling
8a9d5ee1a1 LibJS: Separate raw and capturing native functions
NativeFunction previously stored an AK::Function for every builtin,
even when the callable was just a plain C++ entry point. That mixed
together two different representations, made simple builtins carry
capture storage they did not need, and forced the GC to treat every
native function as if it might contain captured JS values.

Introduce RawNativeFunction for plain NativeFunctionPointer callees
and keep AK::Function-backed callables on a CapturingNativeFunction
subclass. Update the straightforward native registrations in LibJS
and LibWeb to use the raw representation, while leaving exported
Wasm functions on the capturing path because they still capture
state.

Wrap UniversalGlobalScope's byte-length strategy lambda in
Function<...> explicitly so it keeps selecting the capturing
NativeFunction::create overload.
2026-04-15 15:57:48 +02:00
Timothy Flynn
4b1ecbc9df LibJS+LibUnicode: Update icu4x's calendar module to 2.2.0
First: We now pin the icu4x version to an exact number. Minor version
upgrades can result in noisy deprecation warnings and API changes which
cause tests to fail. So let's pin the known-good version exactly.

This patch updates our Rust calendar module to use the new APIs. This
initially caused some test failures due to the new Date::try_new API
(which is the recommended replacement for Date::try_new_from_codes)
having quite a limited year range of +/-9999. So we must use other
APIs (Date::try_from_fields and calendrical_calculations::gregorian)
to avoid these limits.

http://github.com/unicode-org/icu4x/blob/main/CHANGELOG.md#icu4x-22
2026-04-14 18:12:31 -04:00
Andreas Kling
517812647a LibJS: Pack asm Call shared-data metadata
Pack the asm Call fast path metadata next to the executable pointer
so the interpreter can fetch both values with one paired load. This
removes several dependent shared-data loads from the hot path.

Keep the executable pointer and packed metadata in separate registers
through this binding so the fast path can still use the paired-load
layout after any non-strict this adjustment.

Lower the packed metadata flag checks correctly on x86_64 as well.
Those bits now live above bit 31, so the generator uses bt for single-
bit high masks and covers that path with a unit test.

Add a runtime test that exercises both object and global this binding
through the asm Call fast path.
2026-04-14 12:37:12 +02:00
Andreas Kling
8c7c46f8ec LibJS: Inline asm interpreter JS Call fast path
Handle inline-eligible JS-to-JS Call directly in asmint.asm instead
of routing the whole operation through AsmInterpreter.cpp.

The asm handler now validates the callee, binds `this` for the
non-allocating cases, reserves the callee InterpreterStack frame,
populates the ExecutionContext header and Value tail, and enters the
callee bytecode at pc 0.

Keep the cases that need NewFunctionEnvironment() or sloppy `this`
boxing on a narrow helper that still builds an inline frame. This
preserves the existing inline-call semantics for promise-job ordering,
receiver binding, and sloppy global-this handling while keeping the
common path in assembly.

Add regression coverage for closure-capturing callees, sloppy
primitive receivers, and sloppy undefined receivers.
2026-04-14 08:14:43 +02:00
Andreas Kling
12a916d14a LibJS: Handle AsmInt returns without C++ helpers
Handle Return and End entirely in AsmInt when leaving an inline frame.
The handlers now restore the caller, update the interpreter stack
bookkeeping directly, and bump the execution generation without
bouncing through AsmInterpreter.cpp.

Add WeakRef tests that exercise both inline Return and inline End
so this path stays covered.
2026-04-14 08:14:43 +02:00
Andreas Kling
c301a21960 LibJS: Skip preserving zero-argument call callees
The callee and this-value preservation copies only matter while later
argument expressions are still being evaluated. For zero-argument calls
there is nothing left to clobber them, so we can keep the original
operand and let the interpreter load it directly.

This removes the hot Mov arg0->reg pattern from zero-argument local
calls and reduces register pressure.
2026-04-13 18:29:43 +02:00
Andreas Kling
3a08f7b95f LibJS: Drop dead entry GetLexicalEnvironment loads
Teach the Rust bytecode generator to treat the synthetic entry
GetLexicalEnvironment as a removable prologue load.

We still model reg4 as the saved entry lexical environment during
codegen, but assemble() now deletes that load when no emitted
instruction refers to the saved environment register. This keeps the
semantics of unwinding and environment restoration intact while letting
empty functions and other simple bodies start at their first real
instruction.
2026-04-13 18:29:43 +02:00
Andreas Kling
9af5508aef LibJS: Split inline frames from execution context stack
Keep JS-to-JS inline calls out of m_execution_context_stack and walk
the active stack from the running execution context instead. Base
pushes now record the previous running context so duplicate
TemporaryExecutionContext pushes and host re-entry still restore
correctly.

This keeps the fast JS-to-JS path off the vector without losing GC
root collection, stack traces, or helpers that need to inspect the
active execution context chain.
2026-04-13 18:29:43 +02:00
Andreas Kling
2ca7dfa649 LibJS: Move bytecode interpreter state to VM
The bytecode interpreter only needed the running execution context,
but still threaded a separate Interpreter object through both the C++
and asm entry points. Move that state and the bytecode execution
helpers onto VM instead, and teach the asm generator and slow paths to
use VM directly.
2026-04-13 18:29:43 +02:00
Andreas Kling
3e18136a8c LibJS: Add a String.fromCharCode builtin opcode
Specialize only the fixed unary case in the bytecode generator and let
all other argument counts keep using the generic Call instruction. This
keeps the builtin bytecode simple while still covering the common fast
path.

The asm interpreter handles int32 inputs directly, applies the ToUint16
mask in-place, and reuses the VM's cached ASCII single-character
strings when the result is 7-bit representable. Non-ASCII single code
unit results stay on the dedicated builtin path via a small helper, and
the dedicated slow path still handles the generic cases.
2026-04-12 19:15:50 +02:00
Andreas Kling
ce8f92cf6a LibJS: Reuse cached ASCII strings for substrings
Teach the PrimitiveString substring creation path to return the
VM's preallocated single-character ASCII strings instead of always
allocating a deferred Substring.

This keeps one-code-unit ASCII substrings on the same fast path as
direct string creation, including callers like charAt and indexed
string property access.
2026-04-12 19:15:50 +02:00
Andreas Kling
7bc40bd54a LibJS: Add a charAt builtin bytecode fast path
Tag String.prototype.charAt as a builtin and emit a dedicated
bytecode instruction for non-computed calls.

The asm interpreter can then stay on the fast path when the
receiver is a primitive string with resident UTF-16 data and the
selected code unit is ASCII. In that case we can return the VM's
cached empty or single-character ASCII string directly.
2026-04-12 19:15:50 +02:00
RubenKelevra
a1ae402bb9 LibJS: Make folded non-decimal prefix parsing UTF-8-safe
Folded StringToNumber() and StringToBigInt() detected non-decimal
prefixes by slicing the string at byte offset 2. On UTF-8 input this
could split at a non-character boundary and panic.

To prevent this, we replace the byte-based split with ASCII prefix
stripping and preserve rejection of empty suffixes such as "0x", "0o",
and "0b" explicitly before parsing the remaining digits.

This makes non-decimal prefix folding UTF-8-safe and preserves the
expected invalid-result behavior for empty prefixed literals.

Tests:

Add regression coverage for folded StringToNumber() and StringToBigInt()
non-decimal prefix handling to validate the UTF-8 safety fix as
'string-to-number-and-bigint-non-decimal-prefixes.js'.

These tests ensure empty suffixes like "0x", "0o", and "0b" and
other invalid prefixed forms stay invalid, while valid prefixed
literals continue to be accepted.

Since we removed a byte-index split in folded
StringToNumber()/StringToBigInt() coercion that could panic when byte
index 2 landed inside a multi-byte UTF-8 scalar, we add regression
tests for representative panic-shape inputs to ensure these coercions
now return invalid results instead of crashing as
'string-to-number-and-bigint-utf8-boundary.js'
2026-04-12 17:36:51 +02:00
Shannon Booth
ba59640ab2 LibRegex: Avoid hitting backtrack limit for bounded grouped repetitions
Unrolling a bounded quantifier {min,max} into (max-min) optional Split
chains lets the backtracker explore O(2^n) paths, which quickly
exhausts the backtrack limit for large bounds.

Fix this by compiling the optional tail via a RepeatStart/RepeatCheck
counted loop when the atom is known to be non-zero-width. The loop
is safe to use without a progress check precisely because the atom
cannot match empty.

This required making atom_can_be_zero_width recursive into group bodies:
previously it conservatively returned true for all Group and
NonCapturingGroup atoms, so the non-zero-width guard could never fire
for grouped subexpressions.

The old lowering triggered "Regular expression backtrack limit exceeded"
for patterns like /'(?:\\(?:\r\n|[\s\S])|[^'\\\r\n]){0,32}'/, causing
inputs that should match normally (or return null) to throw instead.

Fixes syntax highlighting of the C++ API on https://blend2d.com
2026-04-11 18:43:48 +02:00
Andreas Kling
0969a5cd9a LibJS: Use Substring for legacy regexp statics
Keep the legacy regexp static properties backed by PrimitiveString
values instead of eagerly copied Utf16Strings. lastMatch, leftContext,
rightContext, and $1-$9 now materialize lazy Substrings from the
original match input when accessed.

Keep RegExp.input as a separate slot from the match source so manual
writes do not rewrite the last match state. Add coverage for that
behavior and for rope-backed UTF-16 inputs.
2026-04-11 00:35:36 +02:00
Andreas Kling
8b8136b480 LibJS: Use Substring in Intl.Segmenter
Keep the primitive string that segment() creates alongside the UTF-16
buffer used by LibUnicode. Segment data objects can then return lazy
Substring instances for "segment" and reuse the original
PrimitiveString for "input" instead of copying both strings.

Add a rope-backed UTF-16 segmenter test that exercises both
containing() and iterator results.
2026-04-11 00:35:36 +02:00
Andreas Kling
a9bedc5a8d LibJS: Use Substring for string slices
Route the obvious substring-producing string operations through the
new PrimitiveString substring factory. Direct indexing, at(), charAt(),
slice(), substring(), substr(), and the plain-string split path can now
return lazy JS::Substring values backed by the original string.

Add runtime coverage for rope-backed string operations so these lazy
string slices stay exercised across both ASCII and UTF-16 inputs.
2026-04-11 00:35:36 +02:00
Andreas Kling
f6f791969d LibJS: Use Substring for regexp results
Return JS::Substring objects from the builtin regexp exec and split
paths instead of eagerly copying UTF-16 slices into new strings.
Matches, captures, and split pieces can now point back at the original
input until someone asks for the string contents.

Add focused runtime coverage for UTF-16 captures and regex split
captures so these lazy slices stay exercised.
2026-04-11 00:35:36 +02:00
Andreas Kling
1182250414 LibJS: Add deferred PrimitiveString substrings
Introduce JS::Substring as a lazily materialized PrimitiveString
variant that stores an originating string plus a UTF-16 offset and
length. This makes substring creation cheap while still reifying to
a normal string when character data is requested.

Track which short strings actually live in the VM caches so lazily
resolved ropes and substrings do not evict unrelated cached strings
when they are finalized. Add focused unit tests for nested ranges,
rope-backed substrings, surrogate boundaries, and cache behavior.
2026-04-11 00:35:36 +02:00
Andreas Kling
879ac36e45 LibJS: Cache stable for-in iteration at bytecode sites
Cache the flattened enumerable key snapshot for each `for..in` site and
reuse a `PropertyNameIterator` when the receiver shape, dictionary
generation, indexed storage kind and length, prototype chain
validity, and magical-length state still match.

Handle packed indexed receivers as well as plain named-property
objects. Teach `ObjectPropertyIteratorNext` in `asmint.asm` to return
cached property values directly and to fall back to the slow iterator
logic when any guard fails.

Treat arrays' hidden non-enumerable `length` property as a visited
name for for-in shadowing, and include the receiver's magical-length
state in the cache key so arrays and plain objects do not share
snapshots.

Add `test-js` and `test-js-bytecode` coverage for mixed numeric and
named keys, packed receiver transitions, re-entry, iterator reuse, GC
retention, array length shadowing, and same-site cache reuse.
2026-04-10 15:12:53 +02:00
Andreas Kling
4c1e2222df LibJS: Fast-path safe writes into holey array holes
Teach the asm PutByValue path to materialize in-bounds holey array
elements directly when the receiver is a normal extensible Array with
the default prototype chain and no indexed interference. This avoids
bouncing through generic property setting while preserving the lazy
holey length model.

Keep the fast path narrow so inherited setters, inherited non-writable
properties, and non-extensible arrays still fall back to the generic
semantics. Add regression coverage for those cases alongside the large
holey array stress tests.
2026-04-09 20:06:42 +02:00
Andreas Kling
da1c943161 LibJS: Make holey array lengths lazy
Treat setting a large array length as a logical length change instead of
forcing dictionary indexed storage or materializing every hole up front.
This keeps dense fills on Array(length) on the holey indexed path and
only falls back to sparse storage when later writes actually create a
large realized gap.

The asm indexed get/put fast paths assumed holey arrays always had a
materialized backing store. Guard those paths with a capacity check so
lazy holey arrays fall back safely until an index has been realized.

Add regression coverage for very large holey arrays and for densely
filling a large holey array after pre-sizing it with Array(length).
2026-04-09 20:06:42 +02:00
mikiubo
afc0f8b495 LibRegex: Use Unicode ID_Start/ID_Continue for named group names
Switch to LibUnicode’s ICU-backed functions.
Keep the explicit checks for '$', '_', U+200C, and U+200D that
ECMAScript requires on top of the Unicode properties.

Add test coverage for both the newly accepted case
and regression guards for cases that must continue to work.
2026-04-08 07:31:54 -04:00
Shannon Booth
f27bc38aa7 Everywhere: Remove ShadowRealm support
The proposal has not seemed to progress for a while, and there is
a open issue about module imports which breaks HTML integration.
While we could probably make an AD-HOC change to fix that issue,
it is deep enough in the JS engine that I am not particularly
keen on making that change.

Until other browsers begin to make positive signals about shipping
ShadowRealms, let's remove our implementation for now.

There is still some cleanup that can be done with regard to the
HTML integration, but there are a few more items that need to be
untangled there.
2026-04-05 13:57:58 +02:00
mikiubo
f84edd8173 LibRegex: Fix legacy backreference fallback digit 8 or 9
When a multi-digit decimal escape like \81 exceeds the total capture
group count in non-Unicode mode, the parser falls back to legacy octal
reinterpretation. However, digits '8' and '9' are not valid in octal
(base 8), so passing them to parse_legacy_octal() caused an unwrap()
panic on None from char::to_digit(8).
Treat '8' and '9' as literal characters in the fallback path, matching
the behavior already present for the non-backreference.
2026-04-04 12:12:00 +02:00
Shannon Booth
adabc5cedb LibJS: Handle empty UTF-16 strings in Rust FFI
Treat zero length UTF-16 slices from Rust as empty views at the FFI
boundary instead of assuming a non null backing pointer.

Add a regression test which crashed before these changes. Fixes
a crash loading github.com/ladybirdbrowser/ladybird.
2026-03-31 22:33:36 +02:00
Andreas Kling
201e615aad LibRegex: Preserve set-op direction in backward /v matches
Unicode-set intersection and subtraction always lowered their
post-consumption checks as lookbehinds. That is correct while the
outer matcher runs forward, but inside lookbehind the consumed text
sits to the right of the current position, so the checks must flip
to lookahead instead. Because we always looked left, patterns like
`(?<=[[^A-Z]--[A-Z]])P{N}` and the reported fuzz case missed
matches whenever the character before the consumed one changed the
set-operation result.

Preserve the surrounding match direction when compiling those
checks, and add coverage for reduced subtraction and intersection
cases plus the original regression.
2026-03-31 15:59:04 +02:00
Andreas Kling
e0de4ef33e LibRegex: Reject negated /v classes that contain strings
Negated unicode-set classes are only valid when every member is
single-code-point. We already rejected direct string-valued members
such as `q{ab}` and `p{RGI_Emoji_Flag_Sequence}` inside `[^...]`,
but nested class-set operands could still smuggle them through, so
patterns like `[^[[p{Emoji_Keycap_Sequence}]]]` and the reported
fuzzed literal compiled instead of throwing.

Validate nested class-set expressions after parsing and reject only the
negated `/v` classes whose resulting multi-code-point strings are still
non-empty. Track the exact string members contributed by string
literals, string properties, and nested classes so intersections and
subtractions can eliminate them before the negated-class check runs.

Add constructor and literal coverage for the reduced nested-string
cases, the original regression, and valid negated set operations that
remove every string member.
2026-03-31 15:59:04 +02:00
Andreas Kling
6347827eb8 LibJS: Retry Unicode low-surrogate lastIndex positions
RegExpBuiltinExec used to snap any Unicode lastIndex that landed on a
low surrogate back to the start of the pair. That matched `/😀/u`,
but it skipped valid empty matches when the original low-surrogate
position was itself matchable, such as `/p{Script=Cyrillic}?(?<!\D)/v`
on `"A😘"` and the longer fuzzed global case.

Try the snapped position first, then retry the original lastIndex when
the snapped match fails. Only keep that second result when it is empty
at the original low-surrogate position, so consuming /u and /v matches
still cannot split a surrogate pair. In the Rust VM, treat backward
Unicode matches that start between surrogate halves as having no
complete code point to their left, which matches V8's lookbehind
behavior for those positions.

Add reduced coverage for both low-surrogate exec cases, the original
global match count regression, and the consuming-match retry regression.
2026-03-31 15:59:04 +02:00
Andreas Kling
33f9d464de LibRegex: Preserve negated class direction in lookbehind
Compile the synthetic assertion for negated classes in the same
direction as the surrounding matcher. We were hardcoding a
lookahead for `[^...]`, so lookbehind checked the wrong side of the
current position and missed valid `/v` matches such as
`(?<=[^\p{Emoji}])2`.

Apply the same fix to unicode set classes, since they use the same
negative-lookaround-plus-`AnyChar` lowering for complements. Add
reduced `RegExp.js` coverage for both `[^\p{Emoji}]` and
`[[^\p{Emoji}]]` in lookbehind, plus the original complex `/gv`
regression.
2026-03-31 15:59:04 +02:00
Andreas Kling
c828c87408 LibRegex: Leave suffix minima for repeated simple loops
Repeated simple loops like "a+".repeat(100) compile to a chain of
greedy loop instructions. When one loop failed, the VM only knew how
to give back one character at a time unless the next instruction was a
literal char, so V8's regexp-fallback case ran into the backtrack
limit instead of finding the obvious match.

When a greedy simple loop is followed by more loops for the same
matcher, sum their minimum counts and backtrack far enough to leave the
missing suffix in one step. If that suffix is already available to the
right, still give back one character so the VM makes progress instead
of reusing the same greedy state forever.

The RegExp runtime test now covers the Chromium regexp-fallback case
through exec(), global exec(), and both replace() paths, plus bounded
same-matcher chains where the suffix minimum is partly missing or
already available.
2026-03-31 15:59:04 +02:00
Andreas Kling
f24bdb9f94 LibRegex: Honor wrapped start anchors in search hints
The VM only marked patterns as anchored when the first real instruction
was AssertStart. That missed anchors hidden behind capture setup or a
leading positive lookahead, so patterns like /(^bar)/ and /(?=^bar)\w+/
fell back to whole-subject scanning.

Teach the hint analysis to see through the non-consuming wrappers we
emit ahead of a leading ^, but still run the literal prefilters before
anchored and sticky VM attempts. Missing required literals should stay
cheap no-matches instead of running the full backtracking VM and
raising the step limit.

The RegExp runtime test now covers the Chromium ascii-regexp-subject
case on a long ASCII input and anchored, sticky, and global no-match
cases where the required literal is absent.
2026-03-31 15:59:04 +02:00
Andreas Kling
c12647fc37 LibRegex: Clamp braced quantifier bounds to 2^31 - 1
Browsers clamp braced quantifier bounds above 2^31 - 1 before
checking whether {min,max} is in order. The parser still kept values
up to u32::MAX, so patterns like {2147483648,2147483647} were
rejected even though both bounds should collapse to the same limit.

Clamp parsed braced quantifier bounds to 2^31 - 1 as they are read.
This keeps the existing acceptance of huge exact and open-ended
quantifiers and makes the constructor and regex literal paths agree
with other engines on the out-of-order edge cases.

The RegExp runtime and syntax tests now cover accepted huge
quantifiers, clamped order validation, and huge literal forms. The
reported constructor and literal cases also match other engines.
2026-03-31 15:59:04 +02:00
Andreas Kling
87b22d0c04 LibRegex: Compare set operands by exact string length
Unicode set intersection and subtraction were compiled by matching one
operand and then checking the others with lookbehind. That let a
longer string operand reject a shorter match whenever the longer
string happened to end at the same position.

Group unicode set operands by exact match length and compile each
length class separately, longest first. This keeps longest-match
semantics for unions while making intersection and subtraction compare
only strings of the same length. The new RegExp runtime cases cover
both the reported [a-z]--\q{abc} regression and the related
intersection/subtraction mismatches, and they now agree with V8.
2026-03-31 15:59:04 +02:00
Andreas Kling
50b137f527 LibJS: Reject mixed surrogate forms in RegExp names
Reject surrogate pairs in named group names unless both halves come
from the same raw form. A literal surrogate half was being
normalized into \uXXXX before LibRegex parsed the pattern, which let
mixed literal and escaped forms sneak through.

Validate surrogate handling on the UTF-16 pattern before
normalization, but only treat \k<...> as a named backreference when
the parser would do that too. Legacy regexes without named groups
still use \k as an identity escape, so their literal text must not be
rejected by the pre-scan.

Add runtime and syntax tests for the mixed forms, the valid literal,
fixed-width, and braced escape cases, and the legacy \k literals.
2026-03-31 15:59:04 +02:00
Undefine
b9e5c0e98c Meta: Replace all use of LADYBIRD_PROJECT_ROOT with LADYBIRD_SOURCE_DIR
Those two are equivalent so no need to have two different variables.
2026-03-29 13:59:11 -06:00
Andreas Kling
1f413da8e8 LibRegex: Anchor sticky matches at lastIndex
Sticky regular expressions were still using the generic forward-search
paths inside LibRegex and only enforcing the lastIndex check back in
LibJS after a match had already been found. That made tokenizer-style
sticky patterns spend most of their time scanning for later matches
that would be thrown away.

Route sticky exec() and test() through an anchored VM entry point that
runs exactly once at the requested start position while keeping the
existing literal-hint pruning. Add focused test-js coverage for sticky
literals, alternations, classes, quantifiers, and WebIDL-style token
patterns.
2026-03-29 16:06:57 +02:00