Introduce a StreamingDecoder wrapper that lets callers feed bytes to a
Decoder one chunk at a time. It buffers any incomplete trailing byte
sequence at the end of a chunk and prepends it to the next chunk, so a
multi-byte code point split across a chunk boundary is decoded correctly
once the next chunk arrives.
To support that, add an incomplete_tail_length() virtual on Decoder
returning the number of trailing bytes that form an incomplete sequence
per the Encoding Standard's decoder handler byte ranges, with overrides
for UTF-8, UTF-16BE, UTF-16LE, GB18030, Big5, EUC-JP, ISO-2022-JP,
Shift_JIS, and EUC-KR. The default implementation returns 0, which keeps
single-byte legacy decoders correct.
This is the foundation for the upcoming incremental HTML parser, which
needs to decode network response bodies as they arrive.
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.
While here, add helpers and make use of simdutf for fast conversion.
Implements the corresponding encoders, selects the appropriate one when
encoding URL search params. If an encoder for the given encoding could
not be found, fallback to utf-8.
There were two problems:
1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4
Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.