Commit Graph

35 Commits

Author SHA1 Message Date
Jelle Raaijmakers
04719e6491 LibRegex: Use Fibonacci hashing for regex matches
Compared to the old boost-style combine, this reduces collisions by 36%
after folding for ranges SP=0..300, IP=0..750 and reps=0..50.
2026-02-24 13:24:58 +01:00
aplefull
aeec2c804c LibRegex: Implement Unicode case-insensitive matching
Previously, case-insensitive regex matching used ASCII-only case
conversion (to_ascii_lowercase) even for Unicode characters.

Now we implement Canonicalize abstract operation, so we can case-fold
Unicode characters properly during case-insensitive matching.
2026-02-16 07:51:00 -05:00
Ali Mohammad Pur
fedf0f78ca LibRegex: Reject RSeekTo crossing the current-to-EOL boundary 2026-02-07 14:09:56 +01:00
aplefull
e4572aa9d7 LibRegex: Add support for regex modifiers
This commit implements the regexp-modifiers proposal. It allows us to
use modification of i,m,s flags within groups using
`(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.
2026-01-16 15:00:00 +01:00
mikiubo
535d2476a7 LibRegex: Implement proper lookbehind via new StepBack opcodes
This introduces a new mechanism for evaluating lookbehind assertions by
adding four new bytecode opcodes: SetStepBack, IncStepBack,
CheckStepBack, and CheckSavedPosition.

These opcodes replace the previous GoBack-based approach and enables
correct handling of variable-length lookbehind patterns,
where the match length cannot be known statically.

Track lookbehind greediness in the parser and propagate it to bytecode
generation. Allow controlled backtracking in lookbehind bodies while
avoiding incorrect captures during step-back execution.

Partially fix issue: #3459
2026-01-11 23:24:49 +01:00
Ali Mohammad Pur
637d47ba30 LibRegex: Add an optimisation for replacing /.*x/ with a seek op
This will avoid some catastrophic backtracking by just skipping to 'x'.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
77d982d6fe LibRegex: Restore the pure substring search optimisation for u16view
ca2f0141f6 removed only the execution side
of this, which made it skip some optimisations for pure string searches.
This commit implements it properly for utf16 strings instead.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
e2c6918cdb LibRegex: Fuse consecutive single-char Compares into a String Compare
This avoids huge instruction decoding and dispatch overhead, ~40x
performance improvement for /(^|x)ppp/.
2026-01-05 18:22:11 +01:00
Ali Mohammad Pur
9d49fafdbf LibRegex: Add an optimisation to skip forks that cannot produce a match
...and implement it for 'start of line' checks.
This makes patterns like /(^|x)ppp/ fork-free at runtime, ~30% perf
improvement for that pattern.
2026-01-05 18:22:11 +01:00
Tim Ledbetter
4c491b8920 LibRegex: Remove unused code from RegexStringView 2025-11-26 14:33:59 +00:00
Timothy Flynn
81fc8ab8cc LibRegex: Rename a couple of RegexStringView methods for clarity
`operator[]` -> `code_point_at`
`code_unit_at` -> `unicode_aware_code_point_at`

`unicode_aware_code_point_at` returns either a code point or a code unit
depending on the Unicode flag.
2025-07-21 23:44:18 +02:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
ayeteadoe
0253342c1a LibRegex: Use NO_UNIQUE_ADDRESS in RegexMatch for Windows support
Clang's `x86_64-pc-windows-msvc` target requires
`[[msvc::no_unique_address]]`, which is properly set in the
`NO_UNIQUE_ADDRESS` macro in `AK/Platform.h`. Without this, building
on Windows fails due to `-Wunknown-attributes`.
2025-05-12 03:22:23 -06:00
Ali Mohammad Pur
eea81738cd AK+Everywhere: Recognise that surrogates in utf16 aren't all that common
For the slight cost of counting code points when converting between
encodings and a teeny bit of memory, this commit adds a fast path for
all-happy utf-16 substrings and code point operations.

This seems to be a significant chunk of time spent in many regex
benchmarks.
2025-04-23 07:56:02 -06:00
Ali Mohammad Pur
3b4a184f1a LibRegex: Avoid hashing the state hashes again
We already had a really nice hash that had a single issue, this commit
fixes that and makes it *the* hash for the hash table, so we avoid
double-hashing and making a long chain.
This is an easy 10% perf gain.
2025-04-18 17:09:27 +02:00
Ali Mohammad Pur
76f5dce3db LibRegex: Flatten capture group list in MatchState
This makes copying the capture group COWVector significantly cheaper,
as we no longer have to run any constructors for it - just memcpy.
2025-04-18 17:09:27 +02:00
Andreas Kling
ca2f0141f6 LibRegex: Remove unused "simple substring search" optimization
This is not relevant for LibJS since it only works when the input is
UTF-8, and LibJS always provides UTF-16.
2025-04-16 10:04:50 +02:00
Andreas Kling
96f1f15ad6 LibRegex: Remove unused Utf8View/Utf32View support in RegexStringView 2025-04-16 10:04:50 +02:00
Andreas Kling
c1c3b01a6c LibRegex: Allow Vector<Match> to use trivial memcpy
Now that Match has no more members that need destruction, we can allow
Vector to memcpy them around.
2025-04-14 17:40:13 +02:00
Andreas Kling
5308d77600 LibRegex: Don't use Optional<T> inside regex::Match
This prevented Match from being trivially copyable, which we want it to
be for fast Vector copying.
2025-04-14 17:40:13 +02:00
Andreas Kling
54edf29f1b LibRegex: Make Match::capture_group_name an index into the string table
This removes another Match member that required destruction. The "API"
for accessing the strings is definitely a bit awkward. We'll think of
something nicer eventually.
2025-04-14 17:40:13 +02:00
Andreas Kling
9d47cc54f8 LibRegex: Remove unused regex::Match::string and unused constructor
This shrinks regex::Match by 8 bytes and removes a member that needs
destruction.
2025-04-14 17:40:13 +02:00
Andreas Kling
6b6d3b32a4 LibRegex: Remove the StringCopyMatches mode
This mode made a lot of incorrect assumptions about string lifetimes,
and instead of fixing it, let's just remove it and tweak the few unit
tests that used it.
2025-03-24 22:27:17 +00:00
Andreas Kling
46a5710238 LibJS: Use FlyString in PropertyKey instead of DeprecatedFlyString
This required dealing with *substantial* fallout.
2025-03-24 22:27:17 +00:00
Ali Mohammad Pur
cce000d57c LibRegex: Don't repeat the same fork again
If some state has already been tried, skip over it as it would never
lead to a match regardless.
This fixes performance/memory issues in cases like
/(a+)+b/.exec("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
or
/(a|a?)+b/...

Fixes #2622.
2025-01-17 10:13:51 +01:00
Pavel Shliak
6f81b80114 Everywhere: Include HashMap only where it's actually used 2024-12-09 12:31:16 +01:00
Timothy Flynn
93712b24bf Everywhere: Hoist the Libraries folder to the top-level 2024-11-10 12:50:45 +01:00
Andreas Kling
13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00
AnotherTest
8ba273a2f3 LibJS: Hook up Regex<ECMA262> to RegExpObject and implement `test()'
This makes RegExpObject compile and store a Regex<ECMA262>, adds
all flag-related properties, and implements `RegExpPrototype.test()`
(complete with 'lastIndex' support) :^)
It should be noted that this only implements `test()' using the builtin
`exec()'.
2020-11-27 21:32:41 +01:00
AnotherTest
dbef2b1ee9 LibRegex: Implement an ECMA262-compatible parser
This also adds support for lookarounds and individually-negated
comparisons.
The only unimplemented part of the parser spec is the unicode stuff.
2020-11-27 21:32:41 +01:00
Emanuel Sprung
6add8b9c05 LibRegex: Remove backup file, remove BOM in RegexParser.cpp, run clang-format 2020-11-27 21:32:41 +01:00
Emanuel Sprung
4a630d4b63 LibRegex: Add RegexStringView wrapper to support utf8 and utf32 views 2020-11-27 21:32:41 +01:00
Emanuel Sprung
55450055d8 LibRegex: Add a regular expression library
This commit is a mix of several commits, squashed into one because the
commits before 'Move regex to own Library and fix all the broken stuff'
were not fixable in any elegant way.
The commits are listed below for "historical" purposes:

- AK: Add options/flags and Errors for regular expressions

Flags can be provided for any possible flavour by adding a new scoped enum.
Handling of flags is done by templated Options class and the overloaded
'|' and '&' operators.

- AK: Add Lexer for regular expressions

The lexer parses the input and extracts tokens needed to parse a regular
expression.

- AK: Add regex Parser and PosixExtendedParser

This patchset adds a abstract parser class that can be derived to implement
different parsers. A parser produces bytecode to be executed within the
regex matcher.

- AK: Add regex matcher

This patchset adds an regex matcher based on the principles of the T-REX VM.
The bytecode pruduced by the respective Parser is put into the matcher and
the VM will recursively execute the bytecode according to the available OpCodes.
Possible improvement: the recursion could be replaced by multi threading capabilities.

To match a Regular expression, e.g. for the Posix standard regular expression matcher
use the following API:

```
Pattern<PosixExtendedParser> pattern("^.*$");
auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle

EXPECT(result.count == 1);
EXPECT(result.matches.at(0).view.starts_with("Well"));
EXPECT(result.matches.at(0).view.end() == "!");

result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line

EXPECT(result.count == 2);
EXPECT(result.matches.at(0).view == "Well, hello friends!");
EXPECT(result.matches.at(1).view == "Hello World!");

EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources.
```

- AK: Rework regex to work with opcodes objects

This patchsets reworks the matcher to work on a more structured base.
For that an abstract OpCode class and derived classes for the specific
OpCodes have been added. The respective opcode logic is contained in
each respective execute() method.

- AK: Add benchmark for regex

- AK: Some optimization in regex for runtime and memory

- LibRegex: Move regex to own Library and fix all the broken stuff

Now regex works again and grep utility is also in place for testing.
This commit also fixes the use of regex.h in C by making `regex_t`
an opaque (-ish) type, which makes its behaviour consistent between
C and C++ compilers.
Previously, <regex.h> would've blown C compilers up, and even if it
didn't, would've caused a leak in C code, and not in C++ code (due to
the existence of `OwnPtr` inside the struct).

To make this whole ordeal easier to deal with (for now), this pulls the
definitions of `reg*()` into LibRegex.

pros:
- The circular dependency between LibC and LibRegex is broken
- Eaiser to test (without accidentally pulling in the host's libc!)

cons:
- Using any of the regex.h functions will require the user to link -lregex
- The symbols will be missing from libc, which will be a big surprise
  down the line (especially with shared libs).

Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-11-27 21:32:41 +01:00