Rework our hash functions a bit for significant better performance:
* Rename int_hash to u32_hash to mirror u64_hash.
* Make pair_int_hash call u64_hash instead of multiple u32_hash()es.
* Implement MurmurHash3's fmix32 and fmix64 for u32_hash and u64_hash.
On my machine, this speeds up u32_hash by 20%, u64_hash by ~290%, and
pair_int_hash by ~260%.
We lose the property that an input of 0 results in something that is not
0. I've experimented with an offset to both hash functions, but it
resulted in a measurable performance degradation for u64_hash. If
there's a good use case for 0 not to result in 0, we can always add in
that offset as a countermeasure in the future.
Previously, case-insensitive regex matching used ASCII-only case
conversion (to_ascii_lowercase) even for Unicode characters.
Now we implement Canonicalize abstract operation, so we can case-fold
Unicode characters properly during case-insensitive matching.
Avoid crashing in RegexDebug when saved_positions or step_backs
are empty.
These cases are already handled correctly by the bytecode execution,
but the debug output assumed non-empty vectors.
Print a placeholder instead when no entries are present.
This fixes#7502.
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.
These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.
Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.
Partially fix issue: #3459
ca2f0141f6 removed only the execution side
of this, which made it skip some optimisations for pure string searches.
This commit implements it properly for utf16 strings instead.
Previously, we used restoration based on character position in parser.
This caused the lexer to re-tokenize from the middle of multi-character
tokens like escape sequences, and led to incorrect parse failures for
patterns like `[\[\]]`. We would backtrack to before the first `\[`
token, then re-lex the `[` as a separate token instead of part of the
`\[` escape.
Now we save and restore the actual token object along with the lexer
index, so we keep correct token state when backtracking.
We were incorrectly checking for negated character class when string
properties appeared in nested classes. Now we track negation state in
the parser and correctly reject invalid string properties in negated
classes.
Patterns like /[^\S]/ should match whitespace characters, but previously
would fail to match. The position would advance twice: once during the
character class comparison, and again at the end when temporary_inverse
was reset. This caused matches to be skipped incorrectly.
Now we advance at the end only if position hasn't already changed during
the loop.
Fixes parsing of regex quantifiers with extremely large numeric values.
Previously, very large quantifiers would fail to parse, but Chrome and
Firefox both clamp such large values to 2^31-1 instead of rejecting
them. So now we do the same.
This avoids repeatedly checking if the bytecode has been flattened
(which is always the case by the time we're executing).
1.05x speedup on Octane/regexp.js
a290034a81 passed an empty vector to this,
which caused nodes that appeared multiple times to reset the trie
metadata...which broke the optimisation.
This patchset makes the function take a 'provide missing metadata'
function instead, and only invokes it when the node is missing rather
than unconditionally setting the metadata on all nodes.
Previously, both string_position and view_index used code unit offsets
regardless of mode. Now in unicode mode, these variables track code
point positions while string_position_in_code_units is properly
updated to reflect code unit offsets.
This commit implements support for forward references to named capture
groups. We now allow patterns like \k<name>(?<name>x) and
self-references like (?<name>\k<name>x).
Previously, named capture groups in RegExp results did not always follow
their source order, and unmatched groups were omitted. According to the
spec, all named capture groups must appear in the result object in the
order they are defined, even if they did not participate in the match.
This commit makes sure we follow this requirement.
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
When building with REGEX_DEBUG or ENABLE_ALL_THE_DEBUG_MACROS there are
two issues with linking of bin/TestRegex
- Libraries/LibRegex/RegexDebug.h:76 with undefined reference
regex::OpCode_Compare::variable_arguments_to_byte_string(
AK::Optional<regex::MatchInput const&>) const
- Libraries/LibRegex/RegexByteCode.h:672 with undefined reference
regex::OpCode::name(regex::OpCodeId)
Add REGEX_API on regex::OpCode and regex::OptCode_Compare to allow
access to the classes in bin/TestRegex
Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.