ladybird

eliott/ladybird

Fork 0

mirror of https://github.com/LadybirdBrowser/ladybird synced 2026-05-03 04:52:06 +02:00

Commit Graph

Author	SHA1	Message	Date
Andreas Kling	66fb0a8394	LibRegex/Rust: Add the ECMA-262 regex engine Add LibRegex's new Rust ECMAScript regular expression engine. Replace the old parser's direct pattern-to-bytecode pipeline with a split architecture: parse patterns into a lossless AST first, then lower that AST into bytecode for a dedicated backtracking VM. Keep the syntax tree as the place for validation, analysis, and optimization instead of teaching every transformation to rewrite partially built bytecode. Specialize this backend for the job LibJS actually needs. The old C++ engine shared one generic parser and matcher stack across ECMA-262 and POSIX modes and supported both byte-string and UTF-16 inputs. The new engine focuses on ECMA-262 semantics on WTF-16 data, which lets it model lone surrogates and other JavaScript-specific behavior directly instead of carrying POSIX and multi-encoding constraints through the whole implementation. Fill in the ECMAScript features needed to replace the old engine for real web workloads: Unicode properties and sets, lookahead and lookbehind, named groups and backreferences, modifier groups, string properties, large quantifiers, lone surrogates, and the parser and VM corner cases those features exercise. Reshape the runtime around compile-time pattern hints and a hotter VM loop. Pre-resolve Unicode properties, derive first-character, character-class, and simple-scan filters, extract safe trailing literals for anchored patterns, add literal and literal-alternation fast paths, and keep reusable scratch storage for registers, backtracking state, and modifier stacks. Teach `find_all` to stay inside one VM so global searches stop paying setup costs on every match. Make those shortcuts semantics-aware instead of merely fast. In Unicode mode, do not use literal fast paths for lone surrogates, since ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair. Likewise, only derive end-anchor suffix hints when the suffix lies on every path to `Match`, so lookarounds and disjunctions cannot skip into a shared tail and produce false negatives. This commit lands the Rust crate, the C++ wrapper, the build integration, and the initial LibJS-side plumbing needed to exercise the new engine under real RegExp callers before removing the legacy backend.	2026-03-27 17:32:19 +01:00
mikiubo	535d2476a7	LibRegex: Implement proper lookbehind via new StepBack opcodes This introduces a new mechanism for evaluating lookbehind assertions by adding four new bytecode opcodes: SetStepBack, IncStepBack, CheckStepBack, and CheckSavedPosition. These opcodes replace the previous GoBack-based approach and enables correct handling of variable-length lookbehind patterns, where the match length cannot be known statically. Track lookbehind greediness in the parser and propagate it to bytecode generation. Allow controlled backtracking in lookbehind bodies while avoiding incorrect captures during step-back execution. Partially fix issue: #3459	2026-01-11 23:24:49 +01:00

Author

SHA1

Message

Date

Andreas Kling

66fb0a8394

LibRegex/Rust: Add the ECMA-262 regex engine

Add LibRegex's new Rust ECMAScript regular expression engine.

Replace the old parser's direct pattern-to-bytecode pipeline with a
split architecture: parse patterns into a lossless AST first, then
lower that AST into bytecode for a dedicated backtracking VM. Keep the
syntax tree as the place for validation, analysis, and optimization
instead of teaching every transformation to rewrite partially built
bytecode.

Specialize this backend for the job LibJS actually needs. The old C++
engine shared one generic parser and matcher stack across ECMA-262 and
POSIX modes and supported both byte-string and UTF-16 inputs. The new
engine focuses on ECMA-262 semantics on WTF-16 data, which lets it
model lone surrogates and other JavaScript-specific behavior directly
instead of carrying POSIX and multi-encoding constraints through the
whole implementation.

Fill in the ECMAScript features needed to replace the old engine for
real web workloads: Unicode properties and sets, lookahead and
lookbehind, named groups and backreferences, modifier groups, string
properties, large quantifiers, lone surrogates, and the parser and VM
corner cases those features exercise.

Reshape the runtime around compile-time pattern hints and a hotter VM
loop. Pre-resolve Unicode properties, derive first-character,
character-class, and simple-scan filters, extract safe trailing
literals for anchored patterns, add literal and literal-alternation
fast paths, and keep reusable scratch storage for registers,
backtracking state, and modifier stacks. Teach `find_all` to stay
inside one VM so global searches stop paying setup costs on every
match.

Make those shortcuts semantics-aware instead of merely fast. In Unicode
mode, do not use literal fast paths for lone surrogates, since
ECMA-262 must not let `/\ud83d/u` match inside a surrogate pair.
Likewise, only derive end-anchor suffix hints when the suffix lies on
every path to `Match`, so lookarounds and disjunctions cannot skip into
a shared tail and produce false negatives.

This commit lands the Rust crate, the C++ wrapper, the build
integration, and the initial LibJS-side plumbing needed to exercise
the new engine under real RegExp callers before removing the legacy
backend.

2026-03-27 17:32:19 +01:00

mikiubo

535d2476a7

LibRegex: Implement proper lookbehind via new StepBack opcodes

This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.

These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.

Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.

Partially fix issue: #3459

2026-01-11 23:24:49 +01:00

2 Commits