String::from_utf8_with_replacement_character is equivalent to
https://encoding.spec.whatwg.org/#utf-8-decode from the encoding spec,
so we can simply call through to it.
(cherry picked from commit 0b864bef6040fa66f6719bf06898e310d4c5c02f)
Implements the corresponding encoders, selects the appropriate one when
encoding URL search params. If an encoder for the given encoding could
not be found, fallback to utf-8.
(cherry picked from commit 72d0e3284b604c4c1373fb019250cdf5bd492300)
This way, we still perform UTF-8 validation, but don't go through the
slow generic code path that rebuilds the decoded string one code point
at a time.
This was a bottleneck when loading a canned copy of reddit.com, which
ended up being ~120 MiB large.
- Time spent decoding UTF-8 before this change: 1192 ms
- Time spent decoding UTF-8 after this change: 154 ms
That's still a long time, but 7.7x faster is nothing to sneeze at! :^)
Note that if the input fails UTF-8 validation, we still fall back to
the slow path and insert replacement characters per the WHATWG Encoding
spec: https://encoding.spec.whatwg.org/#utf-8-decode
(cherry picked from commit 1a46d8df5fc81eb2c320d5c8a5597285d3d8fb3a)
The Encoding specification maps ISO-8859-1 to windows-1252 and expects
the windows-1252 translation table to be used, which differs from
ISO-8859-1 for 0x80-0x9F.
Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1
mapping to U+0000-U+00FF, when requesting it.
`decoder_for_exact_name` is introduced, which skips the mapping from
aliases to the encoding name done by `get_standardized_encoding`.
(cherry picked from commit 6b2c4599017f512279cb26c0d3c48aa5a9453007)
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.
For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].
Required by Cloudflare's IUAM challenges.