Commit Graph

87 Commits

Author SHA1 Message Date
Shannon Booth
f29482b038 LibTextCodec: Implement UTF8Decoder::to_utf8 using AK::String
String::from_utf8_with_replacement_character is equivalent to
https://encoding.spec.whatwg.org/#utf-8-decode from the encoding spec,
so we can simply call through to it.

(cherry picked from commit 0b864bef6040fa66f6719bf06898e310d4c5c02f)
2024-10-16 23:56:40 -04:00
0x4261756D
61f6c6c966 LibTextCodec: Add SingleByteEncoders
They are similar to their already existing decoder counterparts.

(cherry picked from commit 96de4ef7e00f44b3f7913db221625940da7f561a)
2024-10-16 10:25:48 -04:00
BenJilks
ab96bf642b LibTextCodec: Implement iso-2022-jp encoder
Implements the `iso-2022-jp` encoder, as specified by
https://encoding.spec.whatwg.org/#iso-2022-jp-encoder

(cherry picked from commit 0ca5675d59dbcb52cedea56729de26b41074024a)
2024-10-15 22:54:51 -04:00
BenJilks
9a4156482d LibTextCodec: Implement shift_jis encoder
Implements the `shift_jis` encoder, as specified by
https://encoding.spec.whatwg.org/#shift_jis-encoder

(cherry picked from commit 08a8d67a5bfa9690eb61977f75b2ed3dc73aa341)
2024-10-15 22:54:51 -04:00
BenJilks
1739838868 LibTextCodec: Implement gb18030 and gbk encoders
Implements the `gb18030` and `gbk` encoders, as specified by
https://encoding.spec.whatwg.org/#gb18030-encoder
https://encoding.spec.whatwg.org/#gbk-encoder

(cherry picked from commit d80575a4101ab0fbc22ff2b714c74530a965cd5c)
2024-10-15 22:54:51 -04:00
BenJilks
4cd250dc2d LibTextCodec: Implement big5 encoder
Implements the `big5` encoder, as specified by
https://encoding.spec.whatwg.org/#big5-encoder

(cherry picked from commit 34c8c559c112796af0f99b48a7b88cb26633a764)
2024-10-15 22:54:51 -04:00
BenJilks
399dc388d6 LibTextCodec: Implement euc-kr encoder
Implements the `euc-kr` encoder, as specified by
https://encoding.spec.whatwg.org/#euc-kr-encoder

(cherry picked from commit 826292536c0e6f82e7173a98f1f3b24216d82fec)
2024-10-15 22:54:51 -04:00
BenJilks
4167d1214a LibTextCodec+LibURL: Implement utf-8 and euc-jp encoders
Implements the corresponding encoders, selects the appropriate one when
encoding URL search params. If an encoder for the given encoding could
not be found, fallback to utf-8.

(cherry picked from commit 72d0e3284b604c4c1373fb019250cdf5bd492300)
2024-10-15 22:54:51 -04:00
Andreas Kling
7021616873 LibTextCodec: Use String::from_utf8() when decoding UTF-8 to UTF-8
This way, we still perform UTF-8 validation, but don't go through the
slow generic code path that rebuilds the decoded string one code point
at a time.

This was a bottleneck when loading a canned copy of reddit.com, which
ended up being ~120 MiB large.

- Time spent decoding UTF-8 before this change: 1192 ms
- Time spent decoding UTF-8 after this change:  154 ms

That's still a long time, but 7.7x faster is nothing to sneeze at! :^)

Note that if the input fails UTF-8 validation, we still fall back to
the slow path and insert replacement characters per the WHATWG Encoding
spec: https://encoding.spec.whatwg.org/#utf-8-decode

(cherry picked from commit 1a46d8df5fc81eb2c320d5c8a5597285d3d8fb3a)
2024-07-27 15:13:16 -04:00
Simon Wanner
92600b3171 LibTextCodec: Use generated lookup tables for all single byte decoders
(cherry picked from commit 0ab4722cee11d27da79bb05c1d53693f39938cf6)
2024-06-27 00:27:56 +02:00
Simon Wanner
c90963161e LibTextCodec: Fix ISO-8859-1 vs. windows-1252 handling in web contexts
The Encoding specification maps ISO-8859-1 to windows-1252 and expects
the windows-1252 translation table to be used, which differs from
ISO-8859-1 for 0x80-0x9F.

Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1
mapping to U+0000-U+00FF, when requesting it.

`decoder_for_exact_name` is introduced, which skips the mapping from
aliases to the encoding name done by `get_standardized_encoding`.

(cherry picked from commit 6b2c4599017f512279cb26c0d3c48aa5a9453007)
2024-06-27 00:27:56 +02:00
Simon Wanner
dc2e64c5b0 LibTextCodec: Fix some incorrect encoding aliases
(cherry picked from commit 46d5cf0443a7061bd7a8a9e495c3f3ee54614f1e)
2024-06-27 00:27:56 +02:00
Simon Wanner
328552a269 LibTextCodec: Bring TextCodec::get_standardized_encoding closer to spec
(cherry picked from commit 09f2d79cb10f84dfaedea61264d8a9d91bdfa17c)
2024-06-27 00:27:56 +02:00
Simon Wanner
11bb216912 LibTextCodec: Add replacement decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
7f3b457e62 LibTextCodec: Add EUC-KR decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
ded6512ca8 LibTextCodec: Add Shift_JIS decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
06f7c393b2 LibTextCodec: Add ISO-2022-JP decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
45f0ae52be LibTextCodec: Add EUC-JP decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
9943bb1d8e LibTextCodec: Add Big5 decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
2ce61fe6ea LibTextCodec: Add GBK/GB18030 decoder
Includes changes from GB-18030-2022, which are not yet included in the
Encoding Specification, but WebKit, Blink and WPT are already updated.
2024-05-31 07:56:26 +02:00
Simon Wanner
9ed52504ab LibTextCodec: Delegate to process() in default validate() implementation 2024-05-31 07:56:26 +02:00
Simon Wanner
88c2586f25 LibTextCodec: Remove unused decoder classes 2024-05-31 07:56:26 +02:00
Simon Wanner
b79815c5a5 LibTextCodec: Add x-mac-cyrillic decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
07a9435da5 LibTextCodec: Add windows-1258 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
275b89720b LibTextCodec: Add windows-1257 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
c76308c7e6 LibTextCodec: Add windows-1256 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
eb9ed10573 LibTextCodec: Add windows-1253 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
2d35687db0 LibTextCodec: Add windows-874 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
1b6878b6ca LibTextCodec: Add KOI8-U decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
1fd3a6f48c LibTextCodec: Add ISO-8859-16 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
3e882f26db LibTextCodec: Sort checks in decoder_for mostly alphabetically
Keeps checks for common encodings (Latin1 & UTF-*) at the top.
2024-05-27 20:50:50 +02:00
Simon Wanner
56241df604 LibTextCodec: Add ISO-8859-14 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
4188e328ac LibTextCodec: Add ISO-8859-13 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
cc640f4363 LibTextCodec: Add ISO-8859-10 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
d73220837e LibTextCodec: Add ISO-8859-8(-I) decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
24028e353e LibTextCodec: Add ISO-8859-7 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
01c3b8091a LibTextCodec: Add ISO-8859-6 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
763d904ad5 LibTextCodec: Add ISO-8859-5 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
c6b17320db LibTextCodec: Add ISO-8859-4 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
6c84edaaa2 LibTextCodec: Add ISO-8859-3 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
fc783199f1 LibTextCodec: Add IBM866 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
96b3c35358 LibTextCodec: Implement table based decoders as SingleByteDecoder
Instead of copy-pasting the implementation, let's use a single class.
This "Single Byte Decoder" concept even exists in the Encoding Spec :^)
2024-05-27 20:50:50 +02:00
Michal Grich
7a6d84d036 LibTextCodec: Add Windows-1250 text decoder
This commit is adding Windows-1250 decoding based on unicode.org
mapping table.
2024-04-23 16:26:16 +02:00
Andreas Kling
3c039903fb LibTextCodec+AK: Don't validate UTF-8 strings twice
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.

This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
2023-12-30 13:49:50 +01:00
Nico Weber
8f47acee6a LibTextCodec: Add PDFDocEncoding decoder 2023-11-22 09:08:06 -07:00
Idan Horowitz
079c96376c LibTextCodec: Support validating encoded inputs 2023-11-17 16:02:36 +01:00
Luke Wilde
eaa4048870 LibTextCodec: Add "get output encoding" from the Encoding specification 2023-06-19 06:12:26 +02:00
Timothy Flynn
00fa23237a LibTextCodec: Change UTF-8's decoder to replace invalid code points
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
2023-05-12 05:47:36 +02:00
Andreas Kling
a504ac3e2a Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case
Let's make it clear that these functions deal with ASCII case only.
2023-03-10 13:15:44 +01:00
Luke Wilde
e864444fe3 LibTextCodec/Latin1: Iterate over input string with u8 instead of char
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.

For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].

Required by Cloudflare's IUAM challenges.
2023-02-28 08:46:06 +00:00