Commit Graph

5 Commits

Author SHA1 Message Date
Simon Wülker
80555eb7e7 script: Don't let LossyDecoder handle the BOM (#41732)
We already handle BOMs when detecting the encoding. The decoder itself
should not touch the BOMs in any way.

Depends on https://github.com/servo/html5ever/pull/704

Testing: New tests start to pass

---------

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
2026-01-23 07:56:08 +00:00
Simon Wülker
9ca7628dbd script: Skip some steps when determining encoding for XML document (#41637)
XML documents do not use the "determine the encoding" algorithm. As far
as I can tell, it is unspecified how they should determine the encoding
instead. We now check the BOM, `Content-Type` header and prescan for an
xml encoding declaration (but don't inherit encodings from iframes or
attempt to determine the encoding from heuristics).

Testing: New tests start to pass
Part of https://github.com/servo/servo/issues/6414

---------

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
2026-01-02 17:53:11 +00:00
Simon Wülker
b0734b7f2d script: Make same-origin iframes inherit encoding from their container document (#41450)
Testing: new tests start to pass
Part of https://github.com/servo/servo/issues/6414

---------

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
2025-12-21 11:03:30 +00:00
Simon Wülker
a58d9727f9 script: Use chardetng to guess encoding when all else fails (#41435)
[`chardetng`](https://github.com/hsivonen/chardetng) is the library used
by gecko to guess encodings.

This makes https://intsys.co.jp/game/panepon/p01/index.html load with
the correct encoding. Notably, that site uses shift-jis but has no
encoding declaration of any kind.

Part of https://github.com/servo/servo/issues/6414

---------

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
2025-12-21 08:53:42 +00:00
Simon Wülker
8c344f5641 script: Prescan byte stream to determine encoding before parsing document (#41376)
Servo currently completely ignores `<meta charset>` tags. When we find
one with an encoding that is incompatible to the current one, then we
should reload the page and start over with the new encoding. A common
optimization that has even made its way into the specification is to
wait for a few bytes to arrive and inspect them for `meta` tags, so the
browser is able to use the correct encoding from the very beginng.

In practice, I've run into problems with our WPT harness when reloading
the page after `meta` tags. Therefore, this change implement the
optimization first, so we never have to reload when running WPT. I've
implemented prescanning in a way where we wait for 1024 bytes to arrive
or for one second to pass, whichever one happens first.

This causes a large number of web platform tests to flip around. I've
looked at most of the new failures and I believe they're reasonable.

Testing: New tests start to pass.
Part of https://github.com/servo/servo/issues/6414

---------

Signed-off-by: Simon Wülker <simon.wuelker@arcor.de>
2025-12-19 09:54:19 +00:00