Add `ECMAScriptRegex`, LibRegex's C++ facade for ECMAScript regexes.
The facade owns compilation, execution, captures, named groups, and
error translation for the Rust backend, which lets callers stop
depending on the legacy parser and matcher types directly. Use it in the
remaining non-LibJS callers: URLPattern, HTML input pattern handling,
and the places in LibHTTP that only needed token validation.
Where a full regex engine was unnecessary, replace those call sites with
direct character checks. Also update focused LibURL, LibHTTP, and WPT
coverage for the migrated callers and corrected surrogate handling.
This aligns our behaviour closer to other browsers, which
_mostly_ consider file scheme URLs as opaque. For test
purposes, allow overriding this behaviour with a commandline
flag.
Previously the authority state was parsed character-by-character in a
loop, appending each character to a buffer. When an '@' character was
encountered, the entire buffer would be re-processed to extract and
percent-encode the username and password portions.
This created O(n^2) behavior for URLs with multiple '@' characters, as
each '@' would trigger reprocessing of all previously buffered content.
This commit changes the authority state parser to process the authority
section in chunks rather than character-by-character:
1. Find the next delimiter ('@', '/', '?', '#', or '\' for special URLs)
2. Process the entire chunk up to that delimiter at once
3. Directly extract username/password from the chunk without buffering
With an additional change of switching to:
iterator_at_byte_offset_without_validation (which does not have any
loops), this reduces the time complexity to O(n) and the included
test found by fuzzing to actually complete parsing :^)
The same-origin domain check always returned true if only the scheme
matched. This was because of missing steps to check for the origin's
domain, which didn't exist. Add this concept to URL::Origin, even
though we do not use it at this stage in document.domain setter.
Co-Authored-By: Luke Wilde <luke@ladybird.org>
Instead of '*', which is a nonsensical value. Similar to what we do
for determining the public suffix, if no match could be made via the
PSL algorithm, then take everything after the second dot as
the registrable domain.
This prevents us from considering e.g. b.b.example and
a.example as the same site.
For the AO defined in the URL specification, in the case the
domain does not match against the PSL, we should be returning
the TLD. This fixes a crash for a bunch of WPT tests using the
Document.domain setter when the test is being served by WPT
locally.
We should be doing similar logic in registrable_domain, but that
unfortunately runs into some other issues, so just leave a FIXME
for now.
No code now relies on using URL's valid state.
A URL can still be _technically_ invalid through use of the URL
constructor or by directly changing URL fields.
However, all URLs should be constructed through the URL parser,
and we should ideally be getting rid of the default constructor
at some stage.
Also, any code which is manually setting URL fields need to be
aware that this is full of pitfalls since there are many different
forms of canonicalization which is bypassed by not going through
the URL parser.
We were using the public suffix of the URL's host as its registrable
domain. But the registrable domain is actually the public suffix plus
one additional label.
URL::basic_parse has a subtle bug where the resulting URL is not set
to valid when StateOveride is provided and the URL parser early returns
a valid URL.
This has not surfaced as a problem so far, as the only users of the
state override API provide an already valid URL buffer and also ignore
the result of basic parsing with a state override.
However, this bug surfaces implementing the URL pattern spec, which as
part of URL canonicalization:
* Provides a dummy URL record
* Basic URL parses that URL with state override
* Checks the result of the URL parser to validate the URL
While we could set URL validity on every early return of the URL parser
during state override, it has been a long standing FIXME around the code
to try and remove the awkward validity state of the URL class. So this
commit makes the first stage of this change by migrating the basic
parser API to return Optional, which also happens to make this subtle
issue not a problem any more.
A couple of reasons:
- Origin's Host (when in the tuple state) can't be null
- There's an "empty host" concept in the spec which is NOT the same as a
null Host, and that was confusing me.
There were some extra steps in there which produced wrong results for
relative file URLs.
Fixes 7 test cases in: https://wpt.live/url/url-constructor.any.html
We also need to adjust the test results in TestURL. The behaviour tested
does not match how URL is specified to work as an absolute relative is
given.
Web specs do not return through javascript percent decoded URL path
components - but we were doing this in a number of places due to the
default behaviour of URL::serialize_path.
Since percent encoded URL paths may not contain valid UTF-8 - this was
resulting in us crashing in these places.
For example - on an HTMLAnchorElement when retrieving the pathname for
the URL of:
http://ladybird.org/foo%C2%91%91
To fix this make the URL class only return the percent encoded
serialized path, matching the URL spec. When the decoded path is
required instead explicitly call URL::percent_decode.
This fixes a crash running WPT URL tests for the anchor element on:
https://wpt.live/url/a-element.html
This is a fetching AO and is only used by LibWeb in the context of fetch
tasks. Move it to LibWeb with other fetch methods.
The main reason for this is that it requires the use of other LibWeb AOs
such as the forgiving Base64 decoder and MIME sniffing. These AOs aren't
available within LibURL.
This URL library ends up being a relatively fundamental base library of
the system, as LibCore depends on LibURL.
This change has two main benefits:
* Moving AK back more towards being an agnostic library that can
be used between the kernel and userspace. URL has never really fit
that description - and is not used in the kernel.
* URL _should_ depend on LibUnicode, as it needs punnycode support.
However, it's not really possible to do this inside of AK as it can't
depend on any external library. This change brings us a little closer
to being able to do that, but unfortunately we aren't there quite
yet, as the code generators depend on LibCore.