When the regular HTML parser is blocked on an external script, the
speculative parser scans ahead and pre-fetches discoverable
sub-resources. Previously those fetches were tracked only in the
parser's own URL list and never registered in the document's preload
map, so when the regular parser later reached each element fetch()'s
consume_a_preloaded_resource() lookup found nothing and issued a
duplicate request — every parser-blocked sub-resource was fetched
twice.
issue_speculative_fetch now creates a PreloadEntry, registers it
under create_a_preload_key(request) in the document's preload map,
and supplies a processResponseConsumeBody callback that populates
the entry. The map insertion happens after fetch() starts because
fetch() runs consume_a_preloaded_resource() synchronously, so
registering the entry beforehand would short-circuit the
speculative fetch itself.
The body-handling steps (1, 2, 5 of the preload algorithm's
processResponseConsumeBody) are factored into a shared
deliver_preload_response helper used by both the speculative parser
and HTMLLinkElement::preload.
Introduce IncrementalDocumentParser, which streams the response body
through a TextCodec::StreamingDecoder into the HTMLTokenizer one chunk
at a time. The tokenizer pauses when it runs out of input and resumes
once the next chunk is appended; when the body closes we close the
tokenizer's input stream so it can finish the parse.
DocumentLoading routes HTML responses through the new parser instead of
buffering the full body before handing it to HTMLParser.
Pull the post-parse-action setup, run loop, and post-parse invocation
out of HTMLParser::run(URL, ...) into a new run_until_completion()
method. The URL overload still calls it; behavior is unchanged. The
incremental parser will use this entry point directly without going
through the URL-setting overload.
Add a ScriptCreatedParser flag plumbed through HTMLParser's constructor
and create_for_scripting(). Only document.open()'s parser sets it to
Yes. Document::close() step 3 now checks is_script_created() so it
correctly skips parsers that weren't created via document.open(),
matching the spec.
Previously the check was just `if (!m_parser)`, which incorrectly let
document.close() insert an EOF into a network-driven parser. The bug
was mostly latent because the network parser used to finish quickly,
but it matters once the network parser stays alive for the duration of
a streamed parse.
Add can_run_out_of_characters() and use it in the
NamedCharacterReference state and consume_next_if_match() so that an
open input stream gets the same code-point-at-a-time treatment as an
active document.write insertion point. Without this, a network chunk
that ends partway through a named character reference or a
multi-character match would make the tokenizer commit to a "no match"
decision before the remaining bytes arrive.
No behavior change for existing callers: the new helper still returns
false once the input stream is closed (which the StringView constructor
sets immediately).
Add an explicit "input stream closed" flag plus the streaming-input API
(append_to_input_stream, close_input_stream, is_input_stream_closed) to
let a future incremental driver feed bytes as they arrive. Rewrite
should_pause_before_next_input_character so the tokenizer pauses when
the buffer is exhausted but more bytes may still arrive, including the
case where a chunk ends in CR (CRLF normalization needs one code point
of lookahead).
Existing call sites are unaffected: the StringView constructor
immediately marks the input stream closed, and insert_eof() now also
closes the stream so document.close() drives the same exit path.
HTML newline normalization collapses CRLF into a single LF, so
next_code_point() needs one code point of lookahead at a CR to decide
whether the CR stands alone or is the first half of a CRLF pair. When
the tokenizer is paused at the insertion point and the next code point
to consume is a CR sitting one position before it, that lookahead has
not been written yet.
Previously the tokenizer consumed the CR and emitted it as LF, so a
subsequent document.write() that began with LF surfaced as a second
LF instead of being absorbed into the original CRLF pair.
Stop one code point earlier in this case and wait for the next write
to arrive. This makes four html5lib write_single WPT tests pass.
The HTML parser's script end tag algorithms save the current insertion
point in an "old insertion point" local before executing a script, then
restore that local after script execution. Ladybird modeled that local
as a single tokenizer field, so nested script execution via
document.write() could overwrite the outer script's saved value.
Keep a stack of old insertion points instead, and adjust saved offsets
when document.write() inserts new input before them. This keeps the
normal script and SVG script paths aligned with the spec text while
leaving the parser-blocking script resume path to set the insertion
point to undefined again.
Replace the spin_until in SVGScriptElement::process_the_script_element
with an async fetch that mirrors HTMLScriptElement's mark_as_ready
pattern. External SVG scripts now fetch and execute asynchronously,
matching Chromium's behavior.
For HTML-embedded SVG scripts, the parser pauses via the existing
schedule_resume_check infrastructure, extended to support SVG scripts
through a new pending_parsing_blocking_svg_script slot on Document.
For top-level XML/SVG documents, scripts execute when their fetch
completes; the load event is delayed via DocumentLoadEventDelayer which
the existing XMLDocumentBuilder::document_end already waits on.
When the HTML parser blocks on a synchronous external script, run a
separate tokenizer over the unparsed input and issue speculative fetches
for the resources it finds (script src, link rel=stylesheet|preload, img
src), with <base href> tracking and template/foreign-content skipping.
Also fills in the previously-stubbed "consume a preloaded resource"
algorithm and the document's "map of preloaded resources", so that
<link rel="preload"> followed by a matching consumer deduplicates to
a single fetch.
Spinning a nested event loop to wait for a parser-blocking script blocks
the calling thread, can deadlock, and creates reentrancy hazards. Switch
to an event-driven pause/resume model, mirroring the prior
HTMLParserEndState refactor (df96b69e7a).
Three WPT document.write tests flip from Fail to Pass and are
rebaselined: all write an external script via document.write() followed
by inline content. With spin_until, control did not return to the caller
of document.write() between writing the script and observing its effects
so the test's order assertions saw a different sequence than the spec
mandates.
Inline JS-to-JS frames no longer live in the raw execution context
vector, so LibWeb callers that need to inspect or pop contexts now go
through VM helpers instead of peeking into that storage directly.
This keeps the execution context bookkeeping encapsulated while
preserving existing microtask and realm-entry checks.
HTMLParser::the_end() had three spin_until calls that blocked the event
loop: step 5 (deferred scripts), step 7 (ASAP scripts), and step 8
(load event delay). This replaces them with an HTMLParserEndState state
machine that progresses asynchronously via callbacks.
The state machine has three phases matching the three spin_until calls:
- WaitingForDeferredScripts: loops executing ready deferred scripts
- WaitingForASAPScripts: waits for ASAP script lists to empty
- WaitingForLoadEventDelay: waits for nothing to delay the load event
Notification triggers re-evaluate the state machine when conditions
change: HTMLScriptElement::mark_as_ready, stylesheet unblocking in
StyleElementBase/HTMLLinkElement, did_stop_being_active_document, and
DocumentLoadEventDelayer decrements. NavigableContainer state changes
(session history readiness, content navigable cleared, lazy load flag)
also trigger re-evaluation of the load event delay check.
Key design decisions and why:
1. Microtask checkpoint in schedule_progress_check(): The old spin_until
called perform_a_microtask_checkpoint() before checking conditions.
This is critical because HTMLImageElement::update_the_image_data step
8 queues a microtask that creates the DocumentLoadEventDelayer.
Without the checkpoint, check_progress() would see zero delayers and
complete before images start delaying the load event.
2. deferred_invoke in schedule_progress_check():
I tried Core::Timer (0ms), queue_global_task, and synchronous calls.
Timers caused non-deterministic ordering with the HTML event loop's
task processing timer, leading to image layout tests failing (wrong
subtest pass/fail patterns). Synchronous calls fired too early during
image load processing before dimensions were set, causing 0-height
images in layout tests. queue_global_task had task ordering issues
with the session history traversal queue. deferred_invoke runs after
the current callback returns but within the same event loop pump,
giving the right balance.
3. Navigation load event guard (m_navigation_load_event_guard): During
cross-document navigation, finalize_a_cross_document_navigation step
2 calls set_delaying_load_events(false) before the session history
traversal activates the new document. This creates a transient state
where the parent's load event delay check sees the about:blank (which
has ready_for_post_load_tasks=true) as the active document and
completes prematurely.
Remove includes from Node.h that are only needed for forward
declarations (AccessibilityTreeNode.h, XMLSerializer.h,
JsonObjectSerializer.h). Extract StyleInvalidationReason and
FragmentSerializationMode enums into standalone lightweight
headers so downstream headers (CSSStyleSheet.h, CSSStyleProperties.h,
HTMLParser.h) can include just the enum they need instead of all of
Node.h. Replace Node.h with forward declarations in headers that only
use Node by pointer/reference.
This breaks the circular dependency between Node.h and
AccessibilityTreeNode.h, reducing AccessibilityTreeNode.h's
recompilation footprint from ~1399 to ~25 files.
When a document is navigated away from while HTMLParser::the_end() is
spinning the event loop (steps 7 and 8), the spin_until stays on the
call stack indefinitely, causing all subsequent event processing on the
same event loop to happen within nested spin_until pumping. Add
is_fully_active() checks to bail out early in this case.
This adds visit_edges(Cell::Visitor&) methods to various helper structs
that contain GC pointers, and makes sure they are called from owning
GC-heap-allocated objects as needed.
These were found by our Clang plugin after expanding its capabilities.
The added rules will be enforced by CI going forward.
Introduce the HTMLSelectedContentElement and integrate it into
<select>, <option> and HTMLParser.
See whatwg/html#10548.
There are two bugs with WPT tests which causes the third subtest
in selectedcontent.html and selectedcontent-mutations.html fail.
See whatwg/html#11882, web-platform-tests/wpt#55849.
This implements parsing part of customizable <select> spec update.
See whatwg/html PR #10548.
Two failing subtests in `html5lib_innerHTML_tests_innerHTML_1.html`
and `customizable-select/select-parsing.html` are due to the spec
still disallowing `<input>` inside `<select>`, even though Chrome
has already implemented this behavoir (see whatwg/html#11288).
An upcoming commit will migrate the contents of Headers.h/cpp to LibHTTP
for use outside of LibWeb. These CORS and MIME helpers depend on other
LibWeb facilities, however, so they cannot be moved.
Step 2.(a).5 says to abort, but we were instead carrying on and would
run steps 3 and 4. Those steps would not change the result at all, but
this avoids a little unnecessary work.
I wrapped a couple of comments at 120 columns while I was at it.
When detecting an element's opening tag, the spec asks us to skip ahead
to the first whitespace or end chevron character before trying to read
attributes. Instead, we were always skipping 2 positions ahead and then
ignoring all whitespace characters and slashes, which was clearly wrong.
Theoretically this could have caused some weird behaviors if part of the
opening tag matched an expected attribute name, but it's very unlikely
to see that in the wild.
This did not cause any immediate issues except generating instances of
`Attr` with useless values which caused some unnecessary work during
encoding detection.
This ends up saving quite a bit of memory on many pages, since UTF-32
uses 4 bytes per code points.
As an example, it reduces the footprint on https://gymgrossisten.com/
by 2 MiB.
Update Element::parse_fragment and Node::unsafely_set_html to
propagate exceptions.
This refactor is needed as a prerequisite for implementing the XML
fragment parser, which requires consistent error handling in fragment
parsing.