Commit Graph

160 Commits

Author SHA1 Message Date
Ben Wiederhake
7fb7025d69 LibRegex: Remove unused header in Regex 2026-02-23 12:15:23 +01:00
Jelle Raaijmakers
1745926fc6 AK+Everywhere: Use MurmurHash3 for int/u64 hashing
Rework our hash functions a bit for significant better performance:

* Rename int_hash to u32_hash to mirror u64_hash.
* Make pair_int_hash call u64_hash instead of multiple u32_hash()es.
* Implement MurmurHash3's fmix32 and fmix64 for u32_hash and u64_hash.

On my machine, this speeds up u32_hash by 20%, u64_hash by ~290%, and
pair_int_hash by ~260%.

We lose the property that an input of 0 results in something that is not
0. I've experimented with an offset to both hash functions, but it
resulted in a measurable performance degradation for u64_hash. If
there's a good use case for 0 not to result in 0, we can always add in
that offset as a countermeasure in the future.
2026-02-20 22:47:24 +01:00
aplefull
aeec2c804c LibRegex: Implement Unicode case-insensitive matching
Previously, case-insensitive regex matching used ASCII-only case
conversion (to_ascii_lowercase) even for Unicode characters.

Now we implement Canonicalize abstract operation, so we can case-fold
Unicode characters properly during case-insensitive matching.
2026-02-16 07:51:00 -05:00
Ali Mohammad Pur
01be1ed583 LibRegex: Mark OpCode_classes with REGEX_API 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
6aba31ba13 LibRegex: Add some FileCheck-like tests to ensure opts don't break 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
fedf0f78ca LibRegex: Reject RSeekTo crossing the current-to-EOL boundary 2026-02-07 14:09:56 +01:00
Ali Mohammad Pur
f4d4bd9ed1 LibRegex: Ignore 'FailIfEmpty' in dot-star loop detection 2026-02-07 14:09:56 +01:00
mikiubo
5aaf08c7cf LibRegex: Make RegexDebug resilient to empty state vectors
Avoid crashing in RegexDebug when saved_positions or step_backs
are empty.
These cases are already handled correctly by the bytecode execution,
but the debug output assumed non-empty vectors.

Print a placeholder instead when no entries are present.
This fixes #7502.
2026-01-21 14:20:08 +01:00
aplefull
e4572aa9d7 LibRegex: Add support for regex modifiers
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
2026-01-16 15:00:00 +01:00
aplefull
6ce312e22f LibRegex: Prevent empty matches in optional quantifiers
Step 2.b of the RepeatMatcher states that once minimum repetitions
are satisfied, empty matches should not be considered for further
repetitions. This was not being enforced for optional quantifiers
like `?`, so we had extra capture group matches.
2026-01-16 01:11:24 +01:00
mikiubo
535d2476a7 LibRegex: Implement proper lookbehind via new StepBack opcodes
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.

These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.

Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.

Partially fix issue: #3459
2026-01-11 23:24:49 +01:00
Jelle Raaijmakers
ae20ecf857 AK+Everywhere: Add Vector::contains(predicate) and use it
No functional changes.
2026-01-08 15:27:30 +00:00
Ali Mohammad Pur
2677338f43 LibRegex: Process RSeekTo candidates in the correct order 2026-01-07 00:14:02 +01:00
Ali Mohammad Pur
9668927dfc LibRegex: Don't generate duplicate results for /.*/ patterns
Since the code pattern may span multiple blocks, this can generate
duplicate results; keep the last one to avoid corrupting the bytecode.
2026-01-06 19:09:27 +01:00
Ali Mohammad Pur
363f1f6568 LibRegex: Correctly calculate ForkIf target offset in tree alternatives 2026-01-06 19:09:27 +01:00
Ali Mohammad Pur
41ce1023b8 LibRegex: Add default initialisers to ParserResult to make gcc happy 2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
fbd898fb54 LibRegex: Use nicer rewrite APIs where possible
Co-Authored-By: Hendiadyoin1 <leon.a@serenityos.org>
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
c1535ef65b LibRegex: Skip multi-op compare overhead when not necessary 2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
637d47ba30 LibRegex: Add an optimisation for replacing /.*x/ with a seek op
This will avoid some catastrophic backtracking by just skipping to 'x'.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
77d982d6fe LibRegex: Restore the pure substring search optimisation for u16view
ca2f0141f6 removed only the execution side
of this, which made it skip some optimisations for pure string searches.
This commit implements it properly for utf16 strings instead.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
e2c6918cdb LibRegex: Fuse consecutive single-char Compares into a String Compare
This avoids huge instruction decoding and dispatch overhead, ~40x
performance improvement for /(^|x)ppp/.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
9d49fafdbf LibRegex: Add an optimisation to skip forks that cannot produce a match
...and implement it for 'start of line' checks.
This makes patterns like /(^|x)ppp/ fork-free at runtime, ~30% perf
improvement for that pattern.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
0acac7f02b LibRegex: Split basic blocks at jump targets too 2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
3f35d84785 LibRegex+LibJS: Flatten the bytecode buffer before regex execution
This makes it so we don't have to unnecessarily check for having a
flattened buffer; significant performance increase.
2026-01-05 18:22:11 +01:00
aplefull
3e391bdb2d LibRegex: Use token-state restoration in character class parsing
Previously, we used restoration based on character position in parser.
This caused the lexer to re-tokenize from the middle of multi-character
tokens like escape sequences, and led to incorrect parse failures for
patterns like `[\[\]]`. We would backtrack to before the first `\[`
token, then re-lex the `[` as a separate token instead of part of the
`\[` escape.

Now we save and restore the actual token object along with the lexer
index, so we keep correct token state when backtracking.
2025-12-23 11:04:16 +01:00
aplefull
ff06a4a9e5 LibRegex: Fix negated class validation for nested string properties
We were incorrectly checking for negated character class when string
properties appeared in nested classes. Now we track negation state in
the parser and correctly reject invalid string properties in negated
classes.
2025-12-23 11:04:16 +01:00
aplefull
f3a32a0b1a LibRegex: Use code unit offset for starting range checks 2025-12-23 11:04:16 +01:00
aplefull
1b570fcd61 LibRegex: Correct negated character class escapes behavior
Patterns like /[^\S]/ should match whitespace characters, but previously
would fail to match. The position would advance twice: once during the
character class comparison, and again at the end when temporary_inverse
was reset. This caused matches to be skipped incorrectly.

Now we advance at the end only if position hasn't already changed during
the loop.
2025-12-23 11:04:16 +01:00
aplefull
52a3c19c0a LibRegex: Clamp large quantifier values instead of rejecting them
Fixes parsing of regex quantifiers with extremely large numeric values.
Previously, very large quantifiers would fail to parse, but Chrome and
Firefox both clamp such large values to 2^31-1 instead of rejecting
them. So now we do the same.
2025-12-23 11:04:16 +01:00
Andreas Kling
7d7886afea LibJS: Don't assume flattened bytecode when dumping OpCode_Compare
Fixes #7129
2025-12-13 16:40:19 -06:00
Andreas Kling
f7ea47145c LibRegex: Only call OpCode::size() once per matcher iteration 2025-12-13 13:51:12 -06:00
Andreas Kling
67b20017dc LibRegex: Cache pointer to flattened bytecode data in OpCode_Compare
This avoids repeatedly checking if the bytecode has been flattened
(which is always the case by the time we're executing).

1.05x speedup on Octane/regexp.js
2025-12-13 13:51:12 -06:00
Andreas Kling
82fe962d96 LibJS: Don't rerun regexp optimizer every time a regexp literal is used 2025-12-12 11:43:35 -06:00
aplefull
934817d45e LibRegex: Add missing StringSet cases 2025-11-27 14:02:04 +01:00
Tim Ledbetter
1abc91ccc6 LibRegex: Put debug mode code block behind a flag
This block should be optimized out anyway, but putting the whole thing
behind a flag makes the intent clearer.
2025-11-26 14:33:59 +00:00
Tim Ledbetter
4c491b8920 LibRegex: Remove unused code from RegexStringView 2025-11-26 14:33:59 +00:00
Tim Ledbetter
061b457bac LibRegex: Use unchecked_empend() where possible 2025-11-26 14:33:59 +00:00
aplefull
eed4dd3745 LibRegex: Add support for string literals in character classes 2025-11-26 11:34:38 +01:00
aplefull
a49c39de32 LibRegex: Support matching unicode multi-character sequences 2025-11-26 11:34:38 +01:00
Ali Mohammad Pur
d5d37abfa5 AK+LibRegex: Only set node metadata on Trie::ensure_child if missing
a290034a81 passed an empty vector to this,
which caused nodes that appeared multiple times to reset the trie
metadata...which broke the optimisation.

This patchset makes the function take a 'provide missing metadata'
function instead, and only invokes it when the node is missing rather
than unconditionally setting the metadata on all nodes.
2025-11-21 02:46:33 +01:00
Ali Mohammad Pur
a290034a81 LibRegex: Start alternation opt nodes with an empty vector
...instead of checking every time whether there's a vector there.
Fixes #6755.
2025-11-08 11:51:27 +01:00
Ali Mohammad Pur
57ef949b61 LibRegex: Account for nested 'or' compare ops
Closes #6647.
2025-11-01 17:49:57 +01:00
aplefull
8c9c2ee289 LibRegex: Track local compares in nested classes 2025-11-01 14:38:08 +01:00
aplefull
5632a52531 LibRegex: Properly track code units in u-v modes
Previously, both string_position and view_index used code unit offsets
regardless of mode. Now in unicode mode, these variables track code
point positions while string_position_in_code_units is properly
updated to reflect code unit offsets.
2025-10-24 21:23:06 +02:00
aplefull
7ce4abe330 LibRegex+LibUnicode: Add unicode string properties 2025-10-24 13:24:55 -04:00
aplefull
4b989b8efd LibRegex: Add support for forward references to named capture groups
This commit implements support for forward references to named capture
groups. We now allow patterns like \k<name>(?<name>x) and
self-references like (?<name>\k<name>x).
2025-10-16 16:37:54 +02:00
aplefull
25a47ceb1b LibRegex+LibJS: Include all named capture groups in source order
Previously, named capture groups in RegExp results did not always follow
their source order, and unmatched groups were omitted. According to the
spec, all named capture groups must appear in the result object in the
order they are defined, even if they did not participate in the match.
This commit makes sure we follow this requirement.
2025-10-16 16:37:54 +02:00
aplefull
c4eef822de LibRegex: Fix backreferences to undefined capture groups
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
2025-10-16 16:37:54 +02:00
Rocco Corsi
3d1d055e27 LibRegex: Export OpCode/OpCode_Compare for REGEX_DEBUG builds
When building with REGEX_DEBUG or ENABLE_ALL_THE_DEBUG_MACROS there are
two issues with linking of bin/TestRegex

 - Libraries/LibRegex/RegexDebug.h:76 with undefined reference
       regex::OpCode_Compare::variable_arguments_to_byte_string(
           AK::Optional<regex::MatchInput const&>) const

 - Libraries/LibRegex/RegexByteCode.h:672 with undefined reference
       regex::OpCode::name(regex::OpCodeId)

Add REGEX_API on regex::OpCode and regex::OptCode_Compare to allow
access to the classes in bin/TestRegex
2025-09-18 11:02:13 +02:00
Callum Law
8ada4b7fdc LibRegex: Account for opcode size when calculating incoming jump edges
Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.
2025-07-28 17:06:58 +02:00