Backreferences can match the empty string when the referenced group
didn't participate in the match, so we shouldn't add their length to the
match_length_minimum, as it makes us skip valid matches.
In non-Unicode mode, incomplete escape sequences like `\x0` or `\u00`
should be parsed as literal characters. `read_digits_as_string` consumed
hex digits but did not restore the parser position when fewer digits
than required were found, and `consume_escaped_code_point` did not
update `current_token` after falling back to literal 'u'.
When `\0` is followed by digits, we backtrack to parse it as a legacy
octal escape. We need to backtrack 2 characters, so
`parse_legacy_octal_escape` sees the leading `0` and can parse sequences
correctly.
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.
These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.
Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.
Partially fix issue: #3459
Previously, we used restoration based on character position in parser.
This caused the lexer to re-tokenize from the middle of multi-character
tokens like escape sequences, and led to incorrect parse failures for
patterns like `[\[\]]`. We would backtrack to before the first `\[`
token, then re-lex the `[` as a separate token instead of part of the
`\[` escape.
Now we save and restore the actual token object along with the lexer
index, so we keep correct token state when backtracking.
We were incorrectly checking for negated character class when string
properties appeared in nested classes. Now we track negation state in
the parser and correctly reject invalid string properties in negated
classes.
Fixes parsing of regex quantifiers with extremely large numeric values.
Previously, very large quantifiers would fail to parse, but Chrome and
Firefox both clamp such large values to 2^31-1 instead of rejecting
them. So now we do the same.
This commit implements support for forward references to named capture
groups. We now allow patterns like \k<name>(?<name>x) and
self-references like (?<name>\k<name>x).
Previously, named capture groups in RegExp results did not always follow
their source order, and unmatched groups were omitted. According to the
spec, all named capture groups must appear in the result object in the
order they are defined, even if they did not participate in the match.
This commit makes sure we follow this requirement.
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
We had typo'd using ClassSetReservedDoublePunctuator which was
resulting in a parse error for the regex:
([^\\:]+?)
With the 'v' flag set.
Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float
However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.
We can also use fast_float for integer parsing.
Take record of the named capture group prior to parsing the group's
body. This requires removal of the recorded minimum length of the named
capture group directly, and now needs to be looked up via the group
minimu lengths table.
To allow storing unicode ranges compactly; this is not utilised at the
moment, but changing this later would've been significantly more
difficult.
Also fixes a few debug logs.