Skip to content

Reduce size of Unicode tables #145219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Aug 10, 2025

Follow up to #145027.
Shave a few bytes from the tables by:

  • Removing ASCII characters from the sets: 31446 bytes to 31420 bytes
  • Replacing Cased with Titlecase_letter: 31420 bytes to 31050 bytes
  • Using match expressions for sufficiently small sets 31050 bytes to 30754 bytes

@rustbot
Copy link
Collaborator

rustbot commented Aug 10, 2025

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 10, 2025
@rustbot
Copy link
Collaborator

rustbot commented Aug 10, 2025

library/core/src/unicode/unicode_data.rs is generated by
src/tools/unicode-table-generator via ./x run src/tools/unicode-table-generator. If you want to modify unicode_data.rs,
please modify the tool then regenerate the library source file with the tool
instead of editing the library source file manually.

@rust-log-analyzer

This comment has been minimized.

@joshtriplett
Copy link
Member

@Kmeakin For that last change to use match, do you have any way to check how the code size change balances with the table size change? Could you build a standalone program that includes the tables and code, calls all the functions once to ensure they're used, and check the total size of that program, code and data?

@Kmeakin Kmeakin force-pushed the km/optimize-unicode-tables branch from c18f085 to 39ae3b7 Compare August 10, 2025 21:42
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Aug 10, 2025

@Kmeakin For that last change to use match, do you have any way to check how the code size change balances with the table size change? Could you build a standalone program that includes the tables and code, calls all the functions once to ensure they're used, and check the total size of that program, code and data?

https://godbolt.org/z/ef5ExG5Eo

  • is_whitespace grows slightly larger, but is offset by getting rid of the 256 bytes of static data
  • is_control gets slightly worse: LLVM doesn't seem to realise that matches!(c, 0x00..=0x1f | 0x7f | 0x80..=0x9f) can be optimized to matches!(c, 0x00..=0x1f | 0x7f..=0x9f)

I will open an issue against LLVM asking them to fix the latter. In the meantime, can we at least merge the commits to remove ASCII characters from the tables?

@Kmeakin Kmeakin force-pushed the km/optimize-unicode-tables branch from 39ae3b7 to d172dba Compare August 10, 2025 23:17
Commit 15acb0e introduced a panic when
running `./x run tools/unicode-table-generator`. Fix it by undoing one
of the refactors.
To make changes in table size obvious from git diffs
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the
total size calculations.
The `merge_ranges` function was very complicated and hard to understand.
Forunately, we can use `slice::chunk_by` to achieve the same thing.
Rewrite `generate_tests` to be more idiomatic.
The ASCII subset of Unicode is fixed and will never change, so we don't
need to generate tables for it with every new Unicode version. This
saves a few bytes of static data and speeds up `char::is_control` and
`char::is_grapheme_extended` on ASCII inputs.

Since the table lookup functions exported from the `unicode` module will
give nonsensical errors on ASCII input (and in fact will panic in debug
mode), I had to add some private wrapper methods to `char` which check
for ASCII-ness first.
`Cased` is a derived property - it is the union of the `Lowercase`
property, the `Uppercase` property, and the `Titlecase_Letter` generaral
category. We already have lookup tables for `Lowercase` and `Uppercase`,
and `Titlecase_Letter` is very small. So instead of duplicating a lookup
table for `Cased`, just test each of those properties in turn.

This probably will be slower than the old approach, but it is not a
public API: it is only used in `string::to_lower` when deciding when a
Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare
case, so should not be performance sensitive.
@okaneco
Copy link
Contributor

okaneco commented Aug 10, 2025

This was the PR that added a lookup table for is_whitespace #99487

The trade-off is execution speed versus the 256 byte table. Unfortunately, there aren't any benches in tree but the author of that PR provided a repo they instrumented with criterion. Some of those examples could be included as well as benching on the current corpora.

I suspect the performance of the match still is a bit slower than the current table-based implementation for is_whitespace.

If the number of codepoint ranges in a set is sufficiently small, it may
be better to simply use a `match` expression rather than a lookup table.
The instructions to implement the `match` may be slightly bigger than
the table that it replaced (hard to predict, depends on architecture and
whatever optimzations LLVM applies), but in return we elimate the lookup
tables and avoid the slower binary search.
@Kmeakin Kmeakin force-pushed the km/optimize-unicode-tables branch from d172dba to b7fa8ef Compare August 10, 2025 23:37
@scottmcm scottmcm added the I-libs-nominated Nominated for discussion during a libs team meeting. label Aug 10, 2025
@scottmcm
Copy link
Member

Since it's changing approach somewhat, nominating for team discussion -- especially in hopes that someone remembers the past work done on the tables and static size and such and can comment whether things were tried in the past.

According to
https://www.unicode.org/policies/stability_policy.html#Property_Value,
the set of codepoints in `Cc` will never change. So we can hard-code
the patterns to match against instead of using a table.
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Aug 11, 2025

  • is_control gets slightly worse: LLVM doesn't seem to realise that matches!(c, 0x00..=0x1f | 0x7f | 0x80..=0x9f) can be optimized to matches!(c, 0x00..=0x1f | 0x7f..=0x9f)

Actually, Cc is guaranteed not to change in the future, so we can just hardcode it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-libs-nominated Nominated for discussion during a libs team meeting. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants