-
Notifications
You must be signed in to change notification settings - Fork 236
Add a fast path for the data state using SSE2 instructions #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A note that if you're looking at html5ever performance, you might want to look at https://github.com/untitaker/html5gum which only has a tokenizer (no tree builder), but claims to be ~5x faster than html5ever at tokenizing.
Would it make sense to check the first character using scalar code and then jump into the SIMD code? |
Maybe! Chromium only enters the SIMD loop when there is a non-whitespace character: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/html/parser/html_document_parser_fastpath.cc;l=781-796 |
I did a very rough benchmark which simply tokenizes https://html.spec.whatwg.org. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me and it's hard to argue with the performance benefits. I've done a quick check of this, but it would be nice to have a few more eyes on it.
@simonwuelker As this is still marked as a draft, I'm not sure if this is blocked on anything.
It's not blocked, I wanted to figure out a way to not have performance regressions anymore. The I tried checking the first character using scalar code and only entering SIMD when it is not in the set, as suggested by @nicoburns . It seems to fix all the regressions seen previously; New Benchmark results
The fact that performance in |
Signed-off-by: Simon Wülker <[email protected]>
Btw, I'm happy to explain what's happening instruction-by-instruction if reviewing this is otherwise too intricate. |
The data state is where the HTML tokenizer spends most of it's time. It is also very simple - all it does is scan the input stream for the next character in a set. This can easily be optimized with SIMD instructions. The algorithm I used is described in https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/.
This change significantly speeds up the tokenizer. Both
lipsum.html
andlipsum.zh.html
see improvements of 70-80%, which is not suprising since they never leave the data state.Very small inputs regress slightly. There is also a performance regression of ~5% for malicious input that consists only of tags (the
strong.html
benchmark). In that case the SIMD instructions are overkill because the target character (<
) is always the first one in the input stream.Note that the implementation could be made significantly faster by not keeping track of newlines. The only use in servo for the line number is for script elements, where the line number eventually ends up in https://github.com/servo/mozjs/blob/d1525dfaee22cc1ea9ee16c552cdeedaa9f20741/mozjs-sys/src/jsglue.cpp#L608.
Benchmark results