-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Reduce size of Unicode tables #145219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Reduce size of Unicode tables #145219
Conversation
|
This comment has been minimized.
This comment has been minimized.
@Kmeakin For that last change to use |
c18f085
to
39ae3b7
Compare
https://godbolt.org/z/ef5ExG5Eo
I will open an issue against LLVM asking them to fix the latter. In the meantime, can we at least merge the commits to remove ASCII characters from the tables? |
39ae3b7
to
d172dba
Compare
Commit 15acb0e introduced a panic when running `./x run tools/unicode-table-generator`. Fix it by undoing one of the refactors.
To make changes in table size obvious from git diffs
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.
The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.
Rewrite `generate_tests` to be more idiomatic.
The ASCII subset of Unicode is fixed and will never change, so we don't need to generate tables for it with every new Unicode version. This saves a few bytes of static data and speeds up `char::is_control` and `char::is_grapheme_extended` on ASCII inputs. Since the table lookup functions exported from the `unicode` module will give nonsensical errors on ASCII input (and in fact will panic in debug mode), I had to add some private wrapper methods to `char` which check for ASCII-ness first.
`Cased` is a derived property - it is the union of the `Lowercase` property, the `Uppercase` property, and the `Titlecase_Letter` generaral category. We already have lookup tables for `Lowercase` and `Uppercase`, and `Titlecase_Letter` is very small. So instead of duplicating a lookup table for `Cased`, just test each of those properties in turn. This probably will be slower than the old approach, but it is not a public API: it is only used in `string::to_lower` when deciding when a Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare case, so should not be performance sensitive.
This was the PR that added a lookup table for The trade-off is execution speed versus the 256 byte table. Unfortunately, there aren't any benches in tree but the author of that PR provided a repo they instrumented with criterion. Some of those examples could be included as well as benching on the current corpora. I suspect the performance of the match still is a bit slower than the current table-based implementation for |
If the number of codepoint ranges in a set is sufficiently small, it may be better to simply use a `match` expression rather than a lookup table. The instructions to implement the `match` may be slightly bigger than the table that it replaced (hard to predict, depends on architecture and whatever optimzations LLVM applies), but in return we elimate the lookup tables and avoid the slower binary search.
d172dba
to
b7fa8ef
Compare
Since it's changing approach somewhat, nominating for team discussion -- especially in hopes that someone remembers the past work done on the tables and static size and such and can comment whether things were tried in the past. |
According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.
Actually, |
Follow up to #145027.
Shave a few bytes from the tables by:
Cased
withTitlecase_letter
: 31420 bytes to 31050 bytesmatch
expressions for sufficiently small sets 31050 bytes to 30754 bytes