Reduce size of Unicode tables #145219

Kmeakin · 2025-08-10T17:42:29Z

Follow up to #145027.
Shave a few bytes from the tables by:

Removing ASCII characters from the sets: 31446 bytes to 31420 bytes
Replacing Cased with Titlecase_letter: 31420 bytes to 31050 bytes
Using match expressions for sufficiently small sets 31050 bytes to 30754 bytes

rustbot · 2025-08-10T17:42:33Z

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-08-10T17:42:35Z

library/core/src/unicode/unicode_data.rs is generated by
src/tools/unicode-table-generator via ./x run src/tools/unicode-table-generator. If you want to modify unicode_data.rs,
please modify the tool then regenerate the library source file with the tool
instead of editing the library source file manually.

joshtriplett · 2025-08-10T19:58:40Z

@Kmeakin For that last change to use match, do you have any way to check how the code size change balances with the table size change? Could you build a standalone program that includes the tables and code, calls all the functions once to ensure they're used, and check the total size of that program, code and data?

Kmeakin · 2025-08-10T22:28:22Z

@Kmeakin For that last change to use match, do you have any way to check how the code size change balances with the table size change? Could you build a standalone program that includes the tables and code, calls all the functions once to ensure they're used, and check the total size of that program, code and data?

https://godbolt.org/z/ef5ExG5Eo

is_whitespace grows slightly larger, but is offset by getting rid of the 256 bytes of static data
is_control gets slightly worse: LLVM doesn't seem to realise that matches!(c, 0x00..=0x1f | 0x7f | 0x80..=0x9f) can be optimized to matches!(c, 0x00..=0x1f | 0x7f..=0x9f)

I will open an issue against LLVM asking them to fix the latter. In the meantime, can we at least merge the commits to remove ASCII characters from the tables?

Commit 15acb0e introduced a panic when running `./x run tools/unicode-table-generator`. Fix it by undoing one of the refactors.

To make changes in table size obvious from git diffs

Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.

The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.

Rewrite `generate_tests` to be more idiomatic.

The ASCII subset of Unicode is fixed and will never change, so we don't need to generate tables for it with every new Unicode version. This saves a few bytes of static data and speeds up `char::is_control` and `char::is_grapheme_extended` on ASCII inputs. Since the table lookup functions exported from the `unicode` module will give nonsensical errors on ASCII input (and in fact will panic in debug mode), I had to add some private wrapper methods to `char` which check for ASCII-ness first.

`Cased` is a derived property - it is the union of the `Lowercase` property, the `Uppercase` property, and the `Titlecase_Letter` generaral category. We already have lookup tables for `Lowercase` and `Uppercase`, and `Titlecase_Letter` is very small. So instead of duplicating a lookup table for `Cased`, just test each of those properties in turn. This probably will be slower than the old approach, but it is not a public API: it is only used in `string::to_lower` when deciding when a Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare case, so should not be performance sensitive.

okaneco · 2025-08-10T23:36:15Z

This was the PR that added a lookup table for is_whitespace #99487

The trade-off is execution speed versus the 256 byte table. Unfortunately, there aren't any benches in tree but the author of that PR provided a repo they instrumented with criterion. Some of those examples could be included as well as benching on the current corpora.

I suspect the performance of the match still is a bit slower than the current table-based implementation for is_whitespace.

If the number of codepoint ranges in a set is sufficiently small, it may be better to simply use a `match` expression rather than a lookup table. The instructions to implement the `match` may be slightly bigger than the table that it replaced (hard to predict, depends on architecture and whatever optimzations LLVM applies), but in return we elimate the lookup tables and avoid the slower binary search.

scottmcm · 2025-08-10T23:54:24Z

Since it's changing approach somewhat, nominating for team discussion -- especially in hopes that someone remembers the past work done on the tables and static size and such and can comment whether things were tried in the past.

According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.

Kmeakin · 2025-08-11T01:06:02Z

is_control gets slightly worse: LLVM doesn't seem to realise that matches!(c, 0x00..=0x1f | 0x7f | 0x80..=0x9f) can be optimized to matches!(c, 0x00..=0x1f | 0x7f..=0x9f)

Actually, Cc is guaranteed not to change in the future, so we can just hardcode it

rustbot assigned scottmcm Aug 10, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 10, 2025

This comment has been minimized.

Sign in to view

Kmeakin force-pushed the km/optimize-unicode-tables branch from c18f085 to 39ae3b7 Compare August 10, 2025 21:42

Kmeakin force-pushed the km/optimize-unicode-tables branch from 39ae3b7 to d172dba Compare August 10, 2025 23:17

Kmeakin added 8 commits August 10, 2025 23:35

fix: Fix panic in unicode-table-generator

7a03b28

Commit 15acb0e introduced a panic when running `./x run tools/unicode-table-generator`. Fix it by undoing one of the refactors.

refactor: Include table sizes in comment at top of unicode_data.rs

55420b1

To make changes in table size obvious from git diffs

refactor: Include size of case conversion tables

25d1876

Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.

refactor: rewrite ranges_from_set

4494509

The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.

refactor: generate_tests

9ecbc4c

Rewrite `generate_tests` to be more idiomatic.

refactor: Add tests for case conversions

1aec3b8

Kmeakin force-pushed the km/optimize-unicode-tables branch from d172dba to b7fa8ef Compare August 10, 2025 23:37

scottmcm added the I-libs-nominated Nominated for discussion during a libs team meeting. label Aug 10, 2025

Kmeakin mentioned this pull request Aug 11, 2025

Missed fold: x == c || (c + 1 <= x && x <= c2) => (c <= x && x <= c2) llvm/llvm-project#152948

Open

refactor: Hard-code char::is_control

3d5b2b8

According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce size of Unicode tables #145219

Reduce size of Unicode tables #145219

Kmeakin commented Aug 10, 2025 •

edited

Loading

Uh oh!

rustbot commented Aug 10, 2025

Uh oh!

rustbot commented Aug 10, 2025

Uh oh!

This comment has been minimized.

joshtriplett commented Aug 10, 2025

Uh oh!

Kmeakin commented Aug 10, 2025 •

edited

Loading

Uh oh!

okaneco commented Aug 10, 2025

Uh oh!

scottmcm commented Aug 10, 2025

Uh oh!

Kmeakin commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reduce size of Unicode tables #145219

Are you sure you want to change the base?

Reduce size of Unicode tables #145219

Conversation

Kmeakin commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Aug 10, 2025

Uh oh!

rustbot commented Aug 10, 2025

Uh oh!

This comment has been minimized.

joshtriplett commented Aug 10, 2025

Uh oh!

Kmeakin commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

okaneco commented Aug 10, 2025

Uh oh!

scottmcm commented Aug 10, 2025

Uh oh!

Kmeakin commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Kmeakin commented Aug 10, 2025 •

edited

Loading

Kmeakin commented Aug 10, 2025 •

edited

Loading

Kmeakin commented Aug 11, 2025 •

edited

Loading