Add a Segmenter implementation that implements the UAX#14 line breaking
rules applicable to ASCII text. This avoids the need to build an ICU
BreakIterator for the majority of text on the web.
Primary motivation for this change is the `VERIFY(icu_success(status))`
line in `Segmenter::create()` that was failing on multiple systems and
where we had to ask people to apply a patch to even know what the error
was.
Since this seems to be a recurring problem, let's just add a little
helper function and print the error codes returned by library calls.
For ASCII text, every character is its own grapheme - there are no
combining characters or emoji sequences. This means grapheme boundary
detection is trivial: next_boundary(i) is simply i+1.
This commit adds AsciiGraphemeSegmenter, a simple Segmenter subclass
that performs O(1) boundary lookups without any ICU overhead.
TextNode::grapheme_segmenter() now checks if the text is ASCII and uses
this fast path, avoiding expensive ICU BreakIterator cloning and
boundary detection for the common case of ASCII-only text.
This is a strictly UTF-16 string with some optimizations for ASCII.
* If created from a short UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an inlined byte buffer.
* If created with a long UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an outlined char buffer.
* If created with a short or long UTF-8 or UTF-16 string that is not
ASCII, then the string is stored in an outlined char16 buffer.
We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.
This also adds a UDL to construct a Utf16View from a string literal:
auto string = u"hello"sv;
This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
When the cursor was positioned at the end of text,
attempting to move it left(using the left arrow key)
would fail because align_boundary() was rejecting
the end-of-text position as a valid boundary.
In the UTF-8 implementation, this prevents out-of-bounds access of the
underlying text data, as the ICU macro would essentially do something
akin to `text[text.length()]`.
The UTF-16 implementation already checks for out-of-bounds, but would
previously return 0. We now return an empty Optional in both impls. This
doesn't affect LibJS (the user of the UTF-16 impl), as it already does
bounds checking before invoking LibUnicode APIs.