Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all
in-tree users compile and execute through `ECMAScriptRegex`.
Stop building the legacy engine, remove its source files and the
POSIX-only fuzzers that depended on it, and update the remaining
LibRegex tests to target the Rust-backed facade instead of the deleted
implementation. Clean up the last includes, comments, and helper paths
that only existed to support the old backend.
After this commit LibRegex has a single ECMAScript engine in-tree,
eliminating duplicated maintenance and unifying future regex work.
Backreferences can match the empty string when the referenced group
didn't participate in the match, so we shouldn't add their length to the
match_length_minimum, as it makes us skip valid matches.
Optimizer wasn't considering case-insensitive mode when checking for
overlap between the repeated element and what follows. So patterns like
`/a*A\d/i` failed to match because 'a' and 'A' weren't seen as
overlapping. We compare them in a case-insensitive way now, when i flag
is set.
`range_contains()` checked if an lhs_range was contained within the
query range, rather than checking for overlap. This caused patterns
like `/A*[A-Z]/` to fail matching "A" because the optimizer didn't
detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only
checked if two character classes were identical, not if they overlapped.
So patterns like `/\d*\w/` failed to match "1" because \d and \w were
not recognized as overlapping, even though all digits are word
characters.
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.
These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.
Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.
Partially fix issue: #3459
Previously, we used restoration based on character position in parser.
This caused the lexer to re-tokenize from the middle of multi-character
tokens like escape sequences, and led to incorrect parse failures for
patterns like `[\[\]]`. We would backtrack to before the first `\[`
token, then re-lex the `[` as a separate token instead of part of the
`\[` escape.
Now we save and restore the actual token object along with the lexer
index, so we keep correct token state when backtracking.
We were incorrectly checking for negated character class when string
properties appeared in nested classes. Now we track negation state in
the parser and correctly reject invalid string properties in negated
classes.
Patterns like /[^\S]/ should match whitespace characters, but previously
would fail to match. The position would advance twice: once during the
character class comparison, and again at the end when temporary_inverse
was reset. This caused matches to be skipped incorrectly.
Now we advance at the end only if position hasn't already changed during
the loop.
Fixes parsing of regex quantifiers with extremely large numeric values.
Previously, very large quantifiers would fail to parse, but Chrome and
Firefox both clamp such large values to 2^31-1 instead of rejecting
them. So now we do the same.
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.
Finishes what 7f6b70fafb started.
Having one part use length and another code unit length lead to crashes,
the added test ensures we don't mess that up again.
This prevents empty matches from overwriting non-empty captures in
quantified alternations. Fixes patterns like (a|a?)+ where the optional
branch would incorrectly overwrite meaningful captures with empty
strings.
We had typo'd using ClassSetReservedDoublePunctuator which was
resulting in a parse error for the regex:
([^\\:]+?)
With the 'v' flag set.
Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>
As LibRegex was not specified in TEST_DIRECTORIES, the existing
Tests/LibRegex subdirectory was not actually included during
configuration. Also the RegexLibC test has not been needed
since migration away from Serenitys LibC was done, so
that test has been fully removed. I also renamed the
Regex.cpp test to TestRegex.cpp to match the naming
convention of most test targets.