Commit Graph

127 Commits

Author SHA1 Message Date
aplefull
934817d45e LibRegex: Add missing StringSet cases 2025-11-27 14:02:04 +01:00
Tim Ledbetter
1abc91ccc6 LibRegex: Put debug mode code block behind a flag
This block should be optimized out anyway, but putting the whole thing
behind a flag makes the intent clearer.
2025-11-26 14:33:59 +00:00
Tim Ledbetter
4c491b8920 LibRegex: Remove unused code from RegexStringView 2025-11-26 14:33:59 +00:00
Tim Ledbetter
061b457bac LibRegex: Use unchecked_empend() where possible 2025-11-26 14:33:59 +00:00
aplefull
eed4dd3745 LibRegex: Add support for string literals in character classes 2025-11-26 11:34:38 +01:00
aplefull
a49c39de32 LibRegex: Support matching unicode multi-character sequences 2025-11-26 11:34:38 +01:00
Ali Mohammad Pur
d5d37abfa5 AK+LibRegex: Only set node metadata on Trie::ensure_child if missing
a290034a81 passed an empty vector to this,
which caused nodes that appeared multiple times to reset the trie
metadata...which broke the optimisation.

This patchset makes the function take a 'provide missing metadata'
function instead, and only invokes it when the node is missing rather
than unconditionally setting the metadata on all nodes.
2025-11-21 02:46:33 +01:00
Ali Mohammad Pur
a290034a81 LibRegex: Start alternation opt nodes with an empty vector
...instead of checking every time whether there's a vector there.
Fixes #6755.
2025-11-08 11:51:27 +01:00
Ali Mohammad Pur
57ef949b61 LibRegex: Account for nested 'or' compare ops
Closes #6647.
2025-11-01 17:49:57 +01:00
aplefull
8c9c2ee289 LibRegex: Track local compares in nested classes 2025-11-01 14:38:08 +01:00
aplefull
5632a52531 LibRegex: Properly track code units in u-v modes
Previously, both string_position and view_index used code unit offsets
regardless of mode. Now in unicode mode, these variables track code
point positions while string_position_in_code_units is properly
updated to reflect code unit offsets.
2025-10-24 21:23:06 +02:00
aplefull
7ce4abe330 LibRegex+LibUnicode: Add unicode string properties 2025-10-24 13:24:55 -04:00
aplefull
4b989b8efd LibRegex: Add support for forward references to named capture groups
This commit implements support for forward references to named capture
groups. We now allow patterns like \k<name>(?<name>x) and
self-references like (?<name>\k<name>x).
2025-10-16 16:37:54 +02:00
aplefull
25a47ceb1b LibRegex+LibJS: Include all named capture groups in source order
Previously, named capture groups in RegExp results did not always follow
their source order, and unmatched groups were omitted. According to the
spec, all named capture groups must appear in the result object in the
order they are defined, even if they did not participate in the match.
This commit makes sure we follow this requirement.
2025-10-16 16:37:54 +02:00
aplefull
c4eef822de LibRegex: Fix backreferences to undefined capture groups
Fixes handling of backreferences when the referenced capture group is
undefined or hasn't participated in the match.
CharacterCompareType::NamedReference is added to distinguish numbered
(\1) from named (\k<name>) backreferences. Numbered backreferences use
exact group lookup. Named backreferences search for participating
groups among duplicates.
2025-10-16 16:37:54 +02:00
Rocco Corsi
3d1d055e27 LibRegex: Export OpCode/OpCode_Compare for REGEX_DEBUG builds
When building with REGEX_DEBUG or ENABLE_ALL_THE_DEBUG_MACROS there are
two issues with linking of bin/TestRegex

 - Libraries/LibRegex/RegexDebug.h:76 with undefined reference
       regex::OpCode_Compare::variable_arguments_to_byte_string(
           AK::Optional<regex::MatchInput const&>) const

 - Libraries/LibRegex/RegexByteCode.h:672 with undefined reference
       regex::OpCode::name(regex::OpCodeId)

Add REGEX_API on regex::OpCode and regex::OptCode_Compare to allow
access to the classes in bin/TestRegex
2025-09-18 11:02:13 +02:00
Callum Law
8ada4b7fdc LibRegex: Account for opcode size when calculating incoming jump edges
Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.
2025-07-28 17:06:58 +02:00
Jelle Raaijmakers
73967ee90c Everywhere: Use HashMap::update() where applicable 2025-07-25 16:22:06 +02:00
Ali Mohammad Pur
c7ad6cd508 LibRegex: Use code unit length in more places that apply
Finishes what 7f6b70fafb started.
Having one part use length and another code unit length lead to crashes,
the added test ensures we don't mess that up again.
2025-07-24 23:09:01 +02:00
aplefull
e2f8f5a350 LibRegex: Fix capture groups in quantified alternations
This prevents empty matches from overwriting non-empty captures in
quantified alternations. Fixes patterns like (a|a?)+ where the optional
branch would incorrectly overwrite meaningful captures with empty
strings.
2025-07-24 10:40:16 +02:00
Jelle Raaijmakers
3db7d802db LibRegex: Early return in Parser::try_skip()
No functional changes.
2025-07-22 09:10:32 -04:00
Jelle Raaijmakers
7f6b70fafb LibRegex: Use code unit length in Matcher<Parser>::match()
We were calling into `view.length()`, which potentially returned the
code _point_ length for Utf16Views. Make sure we use the code unit
length instead, since we're only indexing into code units.
2025-07-22 01:23:52 +02:00
Jelle Raaijmakers
930ac9898f LibRegex: Simplify ternary condition in RegexByteCode
No functional changes.
2025-07-22 01:23:52 +02:00
Timothy Flynn
81fc8ab8cc LibRegex: Rename a couple of RegexStringView methods for clarity
`operator[]` -> `code_point_at`
`code_unit_at` -> `unicode_aware_code_point_at`

`unicode_aware_code_point_at` returns either a code point or a code unit
depending on the Unicode flag.
2025-07-21 23:44:18 +02:00
Timothy Flynn
2dfcc4c307 LibRegex: Compare code units (not code points) in non-Unicode char range 2025-07-21 23:44:18 +02:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Ali Mohammad Pur
5b45223d5f LibRegex: Account for uppercase characters in insensitive patterns 2025-07-12 11:26:23 +02:00
Shannon Booth
bd6581fe22 LibRegex: Correctly use ClassSetReservedPunctuator in ClassSetCharacter
We had typo'd using ClassSetReservedDoublePunctuator which was
resulting in a parse error for the regex:

([^\\:]+?)

With the 'v' flag set.

Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>
2025-07-10 11:41:02 +02:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
Timothy Flynn
62d9a84b8d AK+Everywhere: Replace custom number parsers with fast_float
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float

However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.

We can also use fast_float for integer parsing.
2025-07-03 09:51:56 -04:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
Ali Mohammad Pur
b0e471228d LibRegex: Avoid use-after-return of MatchState in 'is_an_eligible_jump'
The opcode may have last been accessed by
block_satisfies_atomic_rewrite_precondition, which would set it to a
state that no longer exists.
Set the state to the correct one unconditionally to ensure we're looking
at the right value.
Fixes #5145.
2025-06-24 18:43:01 +02:00
Ali Mohammad Pur
2947ae7d6e LibRegex: Move required bytecode.flatten() outside optimization function
Not running the optimization passes should not leave the bytecode in a
broken state. Fixes #5146.
2025-06-24 18:43:01 +02:00
aplefull
486602e796 LibRegex: Fix handling of + quantifier with zero-width matches
Small change that allows quantifiers using Fork* forms (e.g., +) to
succeed after one match, even if that match has zero width.
2025-06-02 15:52:26 +02:00
Ali Mohammad Pur
cfc241f61d LibRegex: Make the trie rewrite optimisation maintain the alt order
This is required by the spec.
2025-05-21 14:28:45 +02:00
Ali Mohammad Pur
2eccd68ba5 LibRegex: Document the append_alternative optimisation a bit 2025-05-21 14:28:45 +02:00
Timothy Flynn
7280ed6312 Meta: Enforce newlines around namespaces
This has come up several times during code review, so let's just enforce
it using a new clang-format 20 option.
2025-05-14 02:01:59 -06:00
ayeteadoe
a3754a7bf1 LibRegex: Annotate classes with export macro for hidden visibility
This fix demos the gradual opt-in migration process libraries can
take to switch to explicit symbol exports via the FOO_API macros.
2025-05-12 03:22:23 -06:00
Andrew Kaster
3dd2fbd041 LibRegex: Move StringTable ctor/dtor out of line
This also moves the next_serial class static into a file scope static.
The public class static was causing visibility issues with certain Linux
builds when hidden visibility was enabled. However, the current design
makes more sense anyway :^).
2025-05-12 03:22:23 -06:00
ayeteadoe
0253342c1a LibRegex: Use NO_UNIQUE_ADDRESS in RegexMatch for Windows support
Clang's `x86_64-pc-windows-msvc` target requires
`[[msvc::no_unique_address]]`, which is properly set in the
`NO_UNIQUE_ADDRESS` macro in `AK/Platform.h`. Without this, building
on Windows fails due to `-Wunknown-attributes`.
2025-05-12 03:22:23 -06:00
Ali Mohammad Pur
022cd1adca LibRegex: Use the right offset when patching jumps through fork-trees
Fixes #4474.
2025-04-27 12:16:15 +02:00
Ali Mohammad Pur
fca1d33fec LibRegex: Correctly calculate the target for Repeat in table alts
Fixes a bunch of websites breaking because we now verify jump offsets by
trying to remove 0-offset jumps.
This has been broken for a good while, it was just rare to see Repeat
inside alternatives that lended themselves well to tree alts.
2025-04-24 01:17:27 -06:00
Ali Mohammad Pur
4b9abdb963 LibRegex: Remove useless jumps (Jump* +0) before running opts
This leads to some more significant performance increases on the simple
/<script|<style|<link/ regex in speedometer (~2x)
2025-04-23 22:57:49 +02:00
Ali Mohammad Pur
ec0836c9ea LibRegex: Don't blindly treat multi-target tree jumps as a single jump
The tree generation was broken, we just didn't notice it because it was
very rarely being picked for more complex bytecodes.
2025-04-23 22:57:49 +02:00
Ali Mohammad Pur
09eb28ee1d LibRegex: Better estimate the cost of laying out alts as a chain
Previously we were counting the total number of *nodes* in the tree for
the chain cost, which greatly underestimated its cost when large
bytecode entries were present,
This commit switches to estimating it using the total bytecode *size*,
which is a closer value to the true cost than the tree node count.

This corresponds to a ~4x perf improvement on /<script|<style|<link/ in
speedometer.
2025-04-23 22:57:49 +02:00
Ali Mohammad Pur
eea81738cd AK+Everywhere: Recognise that surrogates in utf16 aren't all that common
For the slight cost of counting code points when converting between
encodings and a teeny bit of memory, this commit adds a fast path for
all-happy utf-16 substrings and code point operations.

This seems to be a significant chunk of time spent in many regex
benchmarks.
2025-04-23 07:56:02 -06:00
Ali Mohammad Pur
3b4a184f1a LibRegex: Avoid hashing the state hashes again
We already had a really nice hash that had a single issue, this commit
fixes that and makes it *the* hash for the hash table, so we avoid
double-hashing and making a long chain.
This is an easy 10% perf gain.
2025-04-18 17:09:27 +02:00
Ali Mohammad Pur
446a453719 LibRegex: Pull out the first compare to avoid unnecessary execution
This adds a fast-path to drop view indices we know will not match
immediately without going through the regex VM.
2025-04-18 17:09:27 +02:00
Ali Mohammad Pur
76f5dce3db LibRegex: Flatten capture group list in MatchState
This makes copying the capture group COWVector significantly cheaper,
as we no longer have to run any constructors for it - just memcpy.
2025-04-18 17:09:27 +02:00