LibJS: Do not directly append RegExp pattern code points during parse

There apparently is a bit of a disconnect between the spec asking us to construct the pattern using code points and LibRegex not being able to swallow those. Whenever we had multi-byte code points in the pattern and tried to match that in unicode mode, we would fail. Change the parser to encode all non-ASCII code units. Fixes 2 test262 cases in `language/literals/regexp`.
Author: https://github.com/gmta Commit: https://github.com/LadybirdBrowser/ladybird/commit/5d19aacce7a Pull-request: https://github.com/LadybirdBrowser/ladybird/pull/5548 Reviewed-by: https://github.com/alimpfard ✅ Reviewed-by: https://github.com/shannonbooth
2026-05-01 20:17:13 +02:00 · 2025-07-21 14:58:39 +02:00 · 2025-07-21 23:25:00 +00:00
parent 7f6b70fafb
commit 5d19aacce7
2 changed files with 18 additions and 11 deletions
--- a/Libraries/LibJS/Runtime/RegExpObject.cpp
+++ b/Libraries/LibJS/Runtime/RegExpObject.cpp
@@ -96,19 +96,10 @@ ErrorOr<String, ParseRegexPatternError> parse_regex_pattern(StringView pattern,
    auto utf16_pattern = Utf16String::from_utf8(pattern);
    StringBuilder builder;

-    // If the Unicode flag is set, append each code point to the pattern. Otherwise, append each
-    // code unit. But unlike the spec, multi-byte code units must be escaped for LibRegex to parse.
+    // FIXME: We need to escape multi-byte code units for LibRegex to parse since the lexer there doesn't handle unicode.
    auto previous_code_unit_was_backslash = false;
-    for (size_t i = 0; i < utf16_pattern.length_in_code_units();) {
-        if (unicode || unicode_sets) {
-            auto code_point = code_point_at(utf16_pattern, i);
-            builder.append_code_point(code_point.code_point);
-            i += code_point.code_unit_count;
-            continue;
-        }
-
+    for (size_t i = 0; i < utf16_pattern.length_in_code_units(); ++i) {
        u16 code_unit = utf16_pattern.code_unit_at(i);
-        ++i;

        if (code_unit > 0x7f) {
            // Incorrectly escaping this code unit will result in a wildly different regex than intended