ladybird

mirror of https://github.com/LadybirdBrowser/ladybird synced 2026-04-25 17:25:08 +02:00

Author	SHA1	Message	Date
Aliaksandr Kalenik	7685a5e14a	LibRegex: Inline \w and \d ranges in legacy positive char classes Previously, a character class containing any builtin (\d, \w, \s) forced the compiler down the slow "complex class" path, which emits a disjunction of alternatives and backtracks at runtime. For non-unicode, non-unicode-sets, non-negated classes, \w and \d can be inlined as their raw ASCII code-point ranges. The resulting class stays on the fast path and compiles into a single sorted CharClass instruction. The unicode/unicode_sets and negation guards are required for correctness: with the /u + /i flags, \w gains non-ASCII members via case folding (e.g. U+017F, U+212A), and negated classes have a separate, smarter compilation path.	2026-04-21 16:36:38 +02:00
Andreas Kling	e4008393d3	LibRegex: Extract broader required-literal prefilters Analyze parsed patterns for literal substrings that every match must contain and feed the longest one into the existing required-literal hint. This lets the VM reject obvious misses for optional-prefix extractors, literal-prefix parsers, and common-substring alternations without adding more shape-specific fast paths. Use ASCII case-insensitive probes when the required literal comes from a legacy `/i` pattern, and skip modifier groups that change ignore-case so the hint does not over-constrain scoped flag behavior. Keep the analysis within the same 64-code-unit budget that the VM hint already uses so large fixed quantifiers and long literal alternations do not blow up compile time. Add TestRegex coverage for assignment extractors, ASCII ignore- case prefilters, common-substring alternations, and the compile- time regression shapes. TestRegex passes, the long-pattern js probes behave correctly, and regex benchmark spot checks stay fast.	2026-03-29 16:06:57 +02:00
Andreas Kling	0f46413baa	LibRegex: Speed up `\bliteral\b` fast paths Extract whole-pattern `\bliteral\b` cases outside Unicode modes and handle them with literal search plus ASCII word- boundary checks. This keeps the optimization narrow while still covering the token-style patterns that show up often in our regex workloads. Keep Unicode word-boundary patterns on the VM path, since `\b` semantics change there. Add tests for plain matching, ASCII ignore-case matching, and the Unicode fallback case.	2026-03-29 16:06:57 +02:00
Andreas Kling	5b8af39167	LibRegex: Speed up ASCII ignore-case literal alternations Extend the literal alternation fast path to cover ASCII-only `/i` patterns outside Unicode modes. Search candidate start positions with a folded first-code-unit set, then verify the alternatives in source order to preserve ECMAScript semantics. Keep Unicode ignore-case alternations on the VM path, since they can match non-ASCII code points through canonicalization. Add tests for mixed-prefix alternatives, source-order selection, and the Unicode fallback behavior.	2026-03-29 16:06:57 +02:00
Andreas Kling	72a3dce507	LibRegex: Speed up ASCII ignore-case literal search Whole-pattern ignore-case literals already bypass the VM, but the search path still checked every start position linearly with full case-fold comparisons. For ASCII literals we can cheaply scan for the first code unit and only verify the rest at candidate positions. This keeps the existing non-ASCII fallback intact while speeding up common benchmark shapes such as `/puebzr/i`, `/zfvr/gi`, and `/##yv22##/gi`. Add tests covering both alphabetic and punctuation-led ASCII literals under ignore-case matching. TestRegex passes. In Build/release js microbenchmarks, representative cases dropped from about 47ms to 38ms, 95ms to 63ms, and 121ms to 79ms.	2026-03-29 16:06:57 +02:00
Andreas Kling	3efe8043f7	LibRegex: Optimize `(^\|literal)` split prefixes Patterns like `(?:^\|;)\s*foo=...` can only start matching at input start or at occurrences of the separator, but the generic start-position loop still entered the VM at each byte and paid the leading split/backtrack cost on every miss. Teach the start-position analysis to recognize this `(^\|literal)` shape and jump straight to those candidate positions. Keep the optimization narrow: wider literal sets would need a single-pass scanner, and rescanning once per literal would make miss-heavy alternations quadratic. Add a LibRegex test for the cookie-style prefix. TestRegex still passes, and a release js benchmark exercising this shape remains fast.	2026-03-29 16:06:57 +02:00
Andreas Kling	d6b8a5c872	LibRegex: Restore broad ECMA-262 test coverage Port a broad slice of the old TestRegex.cpp coverage to the new ECMAScriptRegex API so we keep exercising the Rust engine with the same kinds of parser and matcher edge cases that used to live in the legacy C++ suite. Restore checks for parser acceptance and rejection, lookbehind captures, quantified alternations, zero-width backreferences, undefined backreferences, optional groups with empty matches, modifier groups, Unicode properties and Unicode sets, empty-match loops, and long fork chains. Leave out the old Posix tests and bytecode-dump assertions that were specific to the removed engine internals. One old boolean expectation for duplicate named groups also did not translate cleanly to the new search-style API, so keep the surrounding duplicate-name coverage in capture-focused tests instead. Verify the port with ./Build/sanitizers/bin/TestRegex.	2026-03-27 17:32:19 +01:00
Andreas Kling	dc2e9bbe91	LibRegex: Avoid widening ASCII regex input Teach the Rust matcher to execute directly on ASCII-backed input. Make the VM and literal fast paths generic over an input trait so we can monomorphize separate ASCII and WTF-16 execution paths without duplicating the regex semantics. Add ASCII-specific FFI entry points and have the C++ bridge dispatch to them whenever Utf16View carries ASCII storage. This removes the per-match widening step from the hot path for exec(), test(), and find_all(), which is exactly where LibJS often hands us pure ASCII strings in 8-bit form. Keep the compiled representation and reported capture offsets in UTF-16 code units so the observable JavaScript behavior stays unchanged.	2026-03-27 17:32:19 +01:00
Andreas Kling	d7bf9d3898	LibRegex: Remove the legacy C++ ECMA-262 engine Delete the old C++ ECMA-262 parser, optimizer, and matcher now that all in-tree users compile and execute through `ECMAScriptRegex`. Stop building the legacy engine, remove its source files and the POSIX-only fuzzers that depended on it, and update the remaining LibRegex tests to target the Rust-backed facade instead of the deleted implementation. Clean up the last includes, comments, and helper paths that only existed to support the old backend. After this commit LibRegex has a single ECMAScript engine in-tree, eliminating duplicated maintenance and unifying future regex work.	2026-03-27 17:32:19 +01:00
Ali Mohammad Pur	62348f1214	LibRegex: Replace virtual-inheritance dispatch with a switch interpreter	2026-03-20 16:10:25 -05:00
aplefull	6f1b7c8d50	LibRegex: Track optional capture groups for match_length_minimum Backreferences can match the empty string when the referenced group didn't participate in the match, so we shouldn't add their length to the match_length_minimum, as it makes us skip valid matches.	2026-02-26 13:50:11 +01:00
aplefull	53a98f26d4	LibRegex: Exclude lookahead assertions from match_length_minimum Lookaheads are zero-width assertions and should not affect the minimum match length.	2026-02-26 13:50:11 +01:00
aplefull	db76c1e27c	LibRegex: Account for case-insensitive matching in optimizer Optimizer wasn't considering case-insensitive mode when checking for overlap between the repeated element and what follows. So patterns like `/a*A\d/i` failed to match because 'a' and 'A' weren't seen as overlapping. We compare them in a case-insensitive way now, when i flag is set.	2026-02-26 13:50:11 +01:00
aplefull	292c0cc486	LibRegex: Detect overlapping character classes and ranges in optimizer `range_contains()` checked if an lhs_range was contained within the query range, rather than checking for overlap. This caused patterns like `/A[A-Z]/` to fail matching "A" because the optimizer didn't detect that 'A' overlaps with [A-Z]. And `char_class_contains()` only checked if two character classes were identical, not if they overlapped. So patterns like `/\d\w/` failed to match "1" because \d and \w were not recognized as overlapping, even though all digits are word characters.	2026-02-26 13:50:11 +01:00
Ali Mohammad Pur	6aba31ba13	LibRegex: Add some FileCheck-like tests to ensure opts don't break	2026-02-07 14:09:56 +01:00
Ali Mohammad Pur	fedf0f78ca	LibRegex: Reject RSeekTo crossing the current-to-EOL boundary	2026-02-07 14:09:56 +01:00
aplefull	e4572aa9d7	LibRegex: Add support for regex modifiers This commit implements the regexp-modifiers proposal. It allows us to use modification of i,m,s flags within groups using `(?flags:subpattern)` and `(?flags-flags:subpattern)` syntax.	2026-01-16 15:00:00 +01:00
aplefull	6ce312e22f	LibRegex: Prevent empty matches in optional quantifiers Step 2.b of the RepeatMatcher states that once minimum repetitions are satisfied, empty matches should not be considered for further repetitions. This was not being enforced for optional quantifiers like `?`, so we had extra capture group matches.	2026-01-16 01:11:24 +01:00
mikiubo	535d2476a7	LibRegex: Implement proper lookbehind via new StepBack opcodes This introduces a new mechanism for evaluating lookbehind assertions by adding four new bytecode opcodes: SetStepBack, IncStepBack, CheckStepBack, and CheckSavedPosition. These opcodes replace the previous GoBack-based approach and enables correct handling of variable-length lookbehind patterns, where the match length cannot be known statically. Track lookbehind greediness in the parser and propagate it to bytecode generation. Allow controlled backtracking in lookbehind bodies while avoiding incorrect captures during step-back execution. Partially fix issue: #3459	2026-01-11 23:24:49 +01:00
Ali Mohammad Pur	2677338f43	LibRegex: Process RSeekTo candidates in the correct order	2026-01-07 00:14:02 +01:00
Ali Mohammad Pur	9668927dfc	LibRegex: Don't generate duplicate results for /.*/ patterns Since the code pattern may span multiple blocks, this can generate duplicate results; keep the last one to avoid corrupting the bytecode.	2026-01-06 19:09:27 +01:00
Ali Mohammad Pur	363f1f6568	LibRegex: Correctly calculate ForkIf target offset in tree alternatives	2026-01-06 19:09:27 +01:00
Ali Mohammad Pur	3f35d84785	LibRegex+LibJS: Flatten the bytecode buffer before regex execution This makes it so we don't have to unnecessarily check for having a flattened buffer; significant performance increase.	2026-01-05 18:22:11 +01:00
aplefull	3e391bdb2d	LibRegex: Use token-state restoration in character class parsing Previously, we used restoration based on character position in parser. This caused the lexer to re-tokenize from the middle of multi-character tokens like escape sequences, and led to incorrect parse failures for patterns like `[\[\]]`. We would backtrack to before the first `\[` token, then re-lex the `[` as a separate token instead of part of the `\[` escape. Now we save and restore the actual token object along with the lexer index, so we keep correct token state when backtracking.	2025-12-23 11:04:16 +01:00
aplefull	ff06a4a9e5	LibRegex: Fix negated class validation for nested string properties We were incorrectly checking for negated character class when string properties appeared in nested classes. Now we track negation state in the parser and correctly reject invalid string properties in negated classes.	2025-12-23 11:04:16 +01:00
aplefull	1b570fcd61	LibRegex: Correct negated character class escapes behavior Patterns like /[^\S]/ should match whitespace characters, but previously would fail to match. The position would advance twice: once during the character class comparison, and again at the end when temporary_inverse was reset. This caused matches to be skipped incorrectly. Now we advance at the end only if position hasn't already changed during the loop.	2025-12-23 11:04:16 +01:00
aplefull	52a3c19c0a	LibRegex: Clamp large quantifier values instead of rejecting them Fixes parsing of regex quantifiers with extremely large numeric values. Previously, very large quantifiers would fail to parse, but Chrome and Firefox both clamp such large values to 2^31-1 instead of rejecting them. So now we do the same.	2025-12-23 11:04:16 +01:00
Ali Mohammad Pur	57ef949b61	LibRegex: Account for nested 'or' compare ops Closes #6647.	2025-11-01 17:49:57 +01:00
aplefull	8c9c2ee289	LibRegex: Track local compares in nested classes	2025-11-01 14:38:08 +01:00
aplefull	c4eef822de	LibRegex: Fix backreferences to undefined capture groups Fixes handling of backreferences when the referenced capture group is undefined or hasn't participated in the match. CharacterCompareType::NamedReference is added to distinguish numbered (\1) from named (\k<name>) backreferences. Numbered backreferences use exact group lookup. Named backreferences search for participating groups among duplicates.	2025-10-16 16:37:54 +02:00
Callum Law	8ada4b7fdc	LibRegex: Account for opcode size when calculating incoming jump edges Not accounting for opcode size when calculating incoming jump edges meant that we were merging nodes where we otherwise shouldn't have been, for example /.a\|.b/.	2025-07-28 17:06:58 +02:00
Ali Mohammad Pur	c7ad6cd508	LibRegex: Use code unit length in more places that apply Finishes what `7f6b70fafb` started. Having one part use length and another code unit length lead to crashes, the added test ensures we don't mess that up again.	2025-07-24 23:09:01 +02:00
aplefull	e2f8f5a350	LibRegex: Fix capture groups in quantified alternations This prevents empty matches from overwriting non-empty captures in quantified alternations. Fixes patterns like (a\|a?)+ where the optional branch would incorrectly overwrite meaningful captures with empty strings.	2025-07-24 10:40:16 +02:00
Timothy Flynn	2dfcc4c307	LibRegex: Compare code units (not code points) in non-Unicode char range	2025-07-21 23:44:18 +02:00
Timothy Flynn	9582895759	AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String	2025-07-18 12:45:38 -04:00
Ali Mohammad Pur	5b45223d5f	LibRegex: Account for uppercase characters in insensitive patterns	2025-07-12 11:26:23 +02:00
Shannon Booth	bd6581fe22	LibRegex: Correctly use ClassSetReservedPunctuator in ClassSetCharacter We had typo'd using ClassSetReservedDoublePunctuator which was resulting in a parse error for the regex: ([^\\:]+?) With the 'v' flag set. Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>	2025-07-10 11:41:02 +02:00
ayeteadoe	25f5936dee	CMake: Rename serenity_* helper functions/macros to ladybird_*	2025-07-03 23:19:41 +02:00
aplefull	486602e796	LibRegex: Fix handling of + quantifier with zero-width matches Small change that allows quantifiers using Fork* forms (e.g., +) to succeed after one match, even if that match has zero width.	2025-06-02 15:52:26 +02:00
Ali Mohammad Pur	cfc241f61d	LibRegex: Make the trie rewrite optimisation maintain the alt order This is required by the spec.	2025-05-21 14:28:45 +02:00
ayeteadoe	11bca38f91	CMake: Build LibRegex tests in Tests/LibRegex not Meta/Lagom As LibRegex was not specified in TEST_DIRECTORIES, the existing Tests/LibRegex subdirectory was not actually included during configuration. Also the RegexLibC test has not been needed since migration away from Serenitys LibC was done, so that test has been fully removed. I also renamed the Regex.cpp test to TestRegex.cpp to match the naming convention of most test targets.	2025-05-14 02:05:12 -06:00
Ali Mohammad Pur	022cd1adca	LibRegex: Use the right offset when patching jumps through fork-trees Fixes #4474.	2025-04-27 12:16:15 +02:00
Ali Mohammad Pur	fca1d33fec	LibRegex: Correctly calculate the target for Repeat in table alts Fixes a bunch of websites breaking because we now verify jump offsets by trying to remove 0-offset jumps. This has been broken for a good while, it was just rare to see Repeat inside alternatives that lended themselves well to tree alts.	2025-04-24 01:17:27 -06:00
Ali Mohammad Pur	76f5dce3db	LibRegex: Flatten capture group list in MatchState This makes copying the capture group COWVector significantly cheaper, as we no longer have to run any constructors for it - just memcpy.	2025-04-18 17:09:27 +02:00
Andreas Kling	96f1f15ad6	LibRegex: Remove unused Utf8View/Utf32View support in RegexStringView	2025-04-16 10:04:50 +02:00
Andreas Kling	5308d77600	LibRegex: Don't use Optional<T> inside regex::Match This prevented Match from being trivially copyable, which we want it to be for fast Vector copying.	2025-04-14 17:40:13 +02:00
Andreas Kling	54edf29f1b	LibRegex: Make Match::capture_group_name an index into the string table This removes another Match member that required destruction. The "API" for accessing the strings is definitely a bit awkward. We'll think of something nicer eventually.	2025-04-14 17:40:13 +02:00
Ali Mohammad Pur	69050da929	LibRegex: Merge inverse string table mappings separately	2025-04-06 20:21:16 +02:00
Ali Mohammad Pur	299b9ca572	LibRegex: Check backreference index before looking it up If a backref happens after it's cleared, the slot may be cleared already.	2025-04-06 20:21:16 +02:00
Andreas Kling	6b6d3b32a4	LibRegex: Remove the StringCopyMatches mode This mode made a lot of incorrect assumptions about string lifetimes, and instead of fixing it, let's just remove it and tweak the few unit tests that used it.	2025-03-24 22:27:17 +00:00

1 2 3 4

176 Commits