Tokenize named character references using a DAFSA #645
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently named character references are implemented using a
phf
map that is repeatedly queried for each character. This works, but has suboptimal performance.Traversing a DAFSA that is generated at compile time makes tokenizing named character references 30% faster. This technique is described in https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/. For illustration, a reduced version of the dafsa is can be viewed here.
Apologies for the big change. If it's too hard to review we could also merge the DAFSA incrementally. Most of the diff is the list of named entities being moved around and a benchmark file being added.
I have not looked into how this affects the binary size. Some more savings are possible by packing the array of result characters.
Do not merge yet - needs servo companion PR.
This is non-breaking change for
html5ever
and a breaking change forweb_atoms
, as the list of named character references no longer lives there.