get_standardized_encoding() implements a web spec, and it shouldn't
return a value for PDFDocEncoding.
Change the calling code to use decoder_for_exact_name(), which can
still find the decoder by the name "PDFDocEncoding".
(Web contents always goes through decoder_for(), whcih calls
decoder_for_exact_name() only if get_standardized_encoding() returns
a value, so the web can no longer access "PDFDocEncoding", and it's
not meant to.)
See also #24594, especially the fourth commit.
Change the utf-8 and utf-16be code in Document.cpp to use
decoder_for_exact_name() for consistency as well. This part is a no-op.
String::from_utf8_with_replacement_character is equivalent to
https://encoding.spec.whatwg.org/#utf-8-decode from the encoding spec,
so we can simply call through to it.
(cherry picked from commit 0b864bef6040fa66f6719bf06898e310d4c5c02f)
This way, we still perform UTF-8 validation, but don't go through the
slow generic code path that rebuilds the decoded string one code point
at a time.
This was a bottleneck when loading a canned copy of reddit.com, which
ended up being ~120 MiB large.
- Time spent decoding UTF-8 before this change: 1192 ms
- Time spent decoding UTF-8 after this change: 154 ms
That's still a long time, but 7.7x faster is nothing to sneeze at! :^)
Note that if the input fails UTF-8 validation, we still fall back to
the slow path and insert replacement characters per the WHATWG Encoding
spec: https://encoding.spec.whatwg.org/#utf-8-decode
(cherry picked from commit 1a46d8df5fc81eb2c320d5c8a5597285d3d8fb3a)
The Encoding specification maps ISO-8859-1 to windows-1252 and expects
the windows-1252 translation table to be used, which differs from
ISO-8859-1 for 0x80-0x9F.
Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1
mapping to U+0000-U+00FF, when requesting it.
`decoder_for_exact_name` is introduced, which skips the mapping from
aliases to the encoding name done by `get_standardized_encoding`.
(cherry picked from commit 6b2c4599017f512279cb26c0d3c48aa5a9453007)
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.
For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].
Required by Cloudflare's IUAM challenges.
There were two problems:
1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4
Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.