Commit Graph

24 Commits

Author SHA1 Message Date
Martin Chrástek
c382e5d254 LibTextCodec: Update GB18030 for GB18030-2022 and import WPT tests
Update the GB18030 encoder to spec-compliantly handle old PUA code
points via a direct byte lookup table (spec step 5). Bake the 18
GB18030-2022 code point updates into indexes.json and remove the
now-unnecessary patching logic from the code generator. Drop the
redundant hardcoded switch in the decoder's range function, as the
range formula already produces correct values.

Import WPT tests for gb18030 decoder, gb18030 encoder, and gbk
encoder, and register the worker variant in TestConfig.ini.
2026-05-09 11:44:42 +02:00
Martin Chrástek
9267d2d408 LibTextCodec: Fix ISO-2022-JP encoder escape seq on unencodable error
When the encoder encounters an unencodable code point while in jis0208
state, the spec says to emit ESC ( B (0x1B 0x28 0x42) to switch to
ASCII mode before returning an error. The encoder was incorrectly
emitting ESC ( J (0x1B 0x28 0x4A) which selects Roman mode instead.

This caused form submission using ISO-2022-JP to produce incorrect
escape sequences when replacing unencodable characters with numeric
character references.

Also imports the WPT iso2022jp-encode-form-errors-stateful test.
2026-05-07 17:46:31 +02:00
Aliaksandr Kalenik
9375499e52 LibTextCodec: Add streaming decoder
Introduce a StreamingDecoder wrapper that lets callers feed bytes to a
Decoder one chunk at a time. It buffers any incomplete trailing byte
sequence at the end of a chunk and prepends it to the next chunk, so a
multi-byte code point split across a chunk boundary is decoded correctly
once the next chunk arrives.

To support that, add an incomplete_tail_length() virtual on Decoder
returning the number of trailing bytes that form an incomplete sequence
per the Encoding Standard's decoder handler byte ranges, with overrides
for UTF-8, UTF-16BE, UTF-16LE, GB18030, Big5, EUC-JP, ISO-2022-JP,
Shift_JIS, and EUC-KR. The default implementation returns 0, which keeps
single-byte legacy decoders correct.

This is the foundation for the upcoming incremental HTML parser, which
needs to decode network response bodies as they arrive.
2026-04-29 04:12:44 +02:00
R-Goc
ae5f28fb40 LibTextEncoder/LibURL: Cleanup includes
Cleans up LibURL/Parser.h to use the forwarding header from
LibTextEncoder.
2026-02-26 18:31:57 +01:00
Timothy Flynn
0fd80a8f99 LibTextCodec+LibWeb: Move isomorphic coders to LibTextCodec
This will be used outside of LibWeb.
2025-11-27 14:57:29 +01:00
ayeteadoe
e497303e94 LibTextCodec: Enable EXPLICIT_SYMBOL_EXPORT 2025-08-23 16:04:36 -06:00
Gingeh
f098bd029c LibTextCodec: Replace unmatched utf16 surrogates 2025-07-05 09:58:57 -04:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
Timothy Flynn
7280ed6312 Meta: Enforce newlines around namespaces
This has come up several times during code review, so let's just enforce
it using a new clang-format 20 option.
2025-05-14 02:01:59 -06:00
Andreas Kling
0e9480b944 AK+LibTextCodec: Stop using Utf16View endianness override
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.

While here, add helpers and make use of simdutf for fast conversion.
2025-04-16 10:04:50 +02:00
Timothy Flynn
93712b24bf Everywhere: Hoist the Libraries folder to the top-level 2024-11-10 12:50:45 +01:00
Andreas Kling
13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00
Lukasz Maciejewski
7e5199a394 LibTextCodec: Fix minor errors in Latin2 decoder 2020-12-28 23:31:12 +01:00
Łukasz Maciejewski
518ba73dcb LibTextCodec: Add Latin2 text decoder (#4579) 2020-12-27 22:44:38 +01:00
Andreas Kling
024059b49b LibTextCodec: Normalize incoming encodings in decoder_for()
Instead of asserting when you call TextCoded::decoder_for() with a
non-standard encoding name, let's be nice and see if we can't find a
decoder for the standardized version of the encoding name.
2020-12-13 18:20:50 +01:00
Luke
f3d2053bff LibTextCodec: Add a function to convert encodings to standardized names
https://encoding.spec.whatwg.org/#names-and-labels
2020-11-14 10:14:03 +01:00
Ben Wiederhake
69a0502f80 LibTextCodec: Mark compilation-unit-only functions as static
This enables a nice warning in case a function becomes dead code.
2020-08-12 20:40:59 +02:00
Nico Weber
ce95628b7f Unicode: Try s/codepoint/code_point/g again
This time, without trailing 's'. Ran:

    git grep -l 'codepoint' | xargs sed -ie 's/codepoint/code_point/g
2020-08-05 22:33:42 +02:00
Nico Weber
19ac1f6368 Revert "Unicode: s/codepoint/code_point/g"
This reverts commit ea9ac3155d.
It replaced "codepoint" with "code_points", not "code_point".
2020-08-05 22:33:42 +02:00
Andreas Kling
ea9ac3155d Unicode: s/codepoint/code_point/g
Unicode calls them "code points" so let's follow their style.
2020-08-03 19:06:41 +02:00
Nico Weber
01522b8d71 LibTextCodec: Simplify Latin1Decoder::to_utf8
No intended behavior change.
2020-07-22 19:16:00 +02:00
Andreas Kling
893a9ff5b0 LibTextCodec: Improve Latin-1 decoder so it decodes everything
I can now see Swedish letters when opening Google in the browser. :^)
2020-05-27 19:52:18 +02:00
Sergey Bugaev
450a2a0f9c Build: Switch to CMake :^)
Closes https://github.com/SerenityOS/serenity/issues/2080
2020-05-14 20:15:18 +02:00
Andreas Kling
e09b83c60c LibTextCodec: Start fleshing out a simple text codec library
We're starting with a very basic decoding API and only ISO-8859-1 and
UTF-8 decoding (and UTF-8 decoding is really a no-op since String is
expected to be UTF-8.)
2020-05-03 23:01:58 +02:00