Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle decoding of input in html5ever #590

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

simonwuelker
Copy link
Contributor

@simonwuelker simonwuelker commented Mar 31, 2025

These changes are an attempt to allow users of html5ever to respect the encodings specified with <meta charset="..."> tags in a spec-compliant way.

The major change is that the https://html.spec.whatwg.org/#input-stream now lives in the html5ever crates (if the decoding wrapper around the tokenizer is used). As a result, the new API surface exposes a "pull" instead of the existing "push" interface.

The entry point to the new API is a DecodingParser, which wraps either a HTML or an XML parser.
After providing some amount of byte input to a DecodingParser, the user can call DecodingParser::parse, which returns an iterator over ParserActions. A parser action is either a <script> tag that needs to be executed or a new encoding that the document should be re-parsed with. The caller can drive the parser by repeatedly advancing this iterator.

The old API is fully preserved, without breaking changes (that I'm aware of).

This is a draft because the design is not final and this needs a companion servo PR to verify the correctness of these changes. Initial feedback is welcome.

Depends on #591.

@simonwuelker simonwuelker force-pushed the encodings branch 5 times, most recently from fff1186 to f4d8f88 Compare April 1, 2025 11:25
@simonwuelker simonwuelker force-pushed the encodings branch 2 times, most recently from 730049b to 7f1f591 Compare April 2, 2025 12:48
github-merge-queue bot pushed a commit to servo/servo that referenced this pull request Apr 8, 2025
Companion PR for servo/html5ever#591

Testing: Covered by WPT
Part of #6414,
#24898, preparation for
servo/html5ever#590

---------

Signed-off-by: Simon Wülker <[email protected]>
github-merge-queue bot pushed a commit to servo/servo that referenced this pull request Apr 8, 2025
Companion PR for servo/html5ever#591

Testing: Covered by WPT
Part of #6414,
#24898, preparation for
servo/html5ever#590

---------

Signed-off-by: Simon Wülker <[email protected]>
github-merge-queue bot pushed a commit to servo/servo that referenced this pull request Apr 8, 2025
Companion PR for servo/html5ever#591

Testing: Covered by WPT
Part of #6414,
#24898, preparation for
servo/html5ever#590

---------

Signed-off-by: Simon Wülker <[email protected]>
TG199 pushed a commit to TG199/servo that referenced this pull request Apr 10, 2025
…36284)

Companion PR for servo/html5ever#591

Testing: Covered by WPT
Part of servo#6414,
servo#24898, preparation for
servo/html5ever#590

---------

Signed-off-by: Simon Wülker <[email protected]>
@simonwuelker simonwuelker force-pushed the encodings branch 2 times, most recently from 6bd5a43 to 676cd9b Compare April 11, 2025 11:04
Signed-off-by: Simon Wülker <[email protected]>
@simonwuelker
Copy link
Contributor Author

The major change is that the https://html.spec.whatwg.org/#input-stream now lives in the html5ever crates (if the decoding wrapper around the tokenizer is used). As a result, the new API surface exposes a "pull" instead of the existing "push" interface.

I am running into architectural issues with this approach, because the prefetch tokenizer needs to run on the same input as the parser, but it shouldn't need to independently decode the incoming bytes. And since the prefetch tokenizer lives in servo, that's going to be difficult when when we decode input in html5ever.
I think it will be easier to keep the decoding in servo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant