Add compressible_bytes data #6604

hsivonen · 2025-05-16T10:40:45Z

As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used.

Preparation for #6537

As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used. Preparation for unicode-org#6537

gemini-code-assist · 2025-05-16T10:40:51Z

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

hsivonen · 2025-05-16T12:02:27Z

Deferring to @robertbastian for what to do about the crates.io-dependent test-tutorials task.

sffc

Some TODOs in the code, which would presumably be resolved when unicode-org/icu/pull/3495 lands

sffc · 2025-05-16T12:40:26Z

components/collator/src/comparison.rs

+        let field = self.arr[usize::from(b >> 3)];
+        let mask = 1 << (b & 0b111);
+        (field & mask) != 0


This appears to correctly reverse the compression mapping in unicode-org/icu/pull/3495

sffc · 2025-05-16T12:43:20Z

components/collator/src/provider.rs

+    #[cfg_attr(feature = "serde", serde(borrow))]
+    pub compressible_bytes: ZeroVec<'data, u8>,


Suggestion (optional): Since this struct now contains 2 fixed-length ZeroVecs, it would be slightly more efficient to represent them as a single ZeroVec, or possibly a MultiFieldsULE.

The other ZeroVec isn't logically fixed-length for all time, but it has 4 items for now and has had 4 items for a long time.

Also, the other one has u16s, and I felt uneasy about introducing bugs to the old thing if I tweak it in a hurry.

Note that a ZeroVec is sized at 3 × usize, which isn't much more efficient than just using a [u8; 32] here.

If you're trying to reduce stack size via indirection, perhaps we could go further and amortize costs? This is essentially the same thing as Shane's suggestion around collapsing the zerovecs, under the hood, but done in a way that allows you to expand last_primaries and also avoids funky indexing/bit math.

#[make_varule(CollationSpecialPrimariesInternalULE)] struct CollationSpecialPrimariesInternal<'data> { numeric_primary: u8, bytes: [u8; 32], last_primaries: ZeroVec<'data, u16>, } struct CollationSpecialPrimaries { internal: VarZeroCow<'data, CollationSpecialPrimariesInternalULE> }

The current working of make_varule will lead to CollationSpecialPrimariesInternalULE having a .bytes() method that returns a stack copy of bytes. We don't actually want that here, it's a big type. A 4-register copy isn't a huge deal, but worth avoiding.

I have a couple ideas for designs for this, but broadly speaking this would be a good time to introduce some further control over these generated getters. A simple thing would be to add a #[zerovec::ref_getter] attribute to bytes that makes it return a reference instead. We should put a bit of thought into what the full extent of configurability looks like here, so we can design the attribute syntax holistically (zerovec::getter(ref)? Also should we be doing something special for AsULE<ULE = Self> types?) but

or possibly a MultiFieldsULE

We're not putting this in a VarZeroVec, so a VarULE type isn't that useful on its own. It can't be stuck flat into the struct above. Perhaps paired with a VarZeroCow, but the VarZeroCow has a similar stack cost as a VarZeroVec; we only really gain from it if we collapse the entire type as I sketch out above.

which isn't much more efficient than just using a [u8; 32] here.

Are we allowed to use [u8; 32] directly in a data struct?

Yes, as it's Copy. The reason not to use it would be stack size, not sure how relevant this is here.

Observation: if you can get the compressible bytes to fit in [u8; 31] instead of [u8; 32], then it can fit into the padding of the CollationSpecialPrimaries and the stack size is equal to what it would be if you added another ZeroVec.

Unfortunately, I don't know enough of what guarantees FractionalUCA/genuca make to know whether [u8; 31] would work.

Yes, as it's Copy. The reason not to use it would be stack size, not sure how relevant this is here.

I didn't know that. I've refreshed the PR.

Stack size how? The data struct itself is still behind a reference, right?

Stack size might be relevant for the jamo table. The obvious follow-up question is: should the jamo table have been a direct array instead of ZeroVec?

In the borrowed version it's behind a reference. In the owned version, the DataPayload contains the data struct (plus an RC to the buffer).

Unfortunately, I don't know enough of what guarantees FractionalUCA/genuca make to know whether [u8; 31] would work.

Currently, the last item is non-zero and first item is zero.

hsivonen · 2025-05-16T14:28:44Z

Some TODOs in the code, which would presumably be resolved when unicode-org/icu/pull/3495 lands

The TODOs in datagen can be fixed independently of anything else. I didn't add datagen-level length enforcement here, because we don't have it for the jamo table. Also, datagen-level length enforcement isn't a must-have, because we have run-time length enforcement.

hsivonen · 2025-05-19T06:41:12Z

So the reason why this didn't get merged on Friday is that we ourselves depend on semver-correctness, so perhaps we should go back to my initial suggestion and do this by the book instead of trying a semver-violation in an effort to reduce the total number of structs.

I need to focus on other things this week, but the following would be by the book and surely landable from a semver standpoint:

From this PR, keep the CompressibleBytes accessor wrapper type and the notion of hardcoding ICU4C 77.x values for compressible bytes in datagen when actual TOML is missing.
Keep Collator loading SpecialPrimaries conditionally as today on main.
Introduce a new singleton data struct just for compressible bytes.
Introduce CollationKeyGenerator/CollationKeyGeneratorBorrowed that wraps Collator/CollatorBorrowed plus the data for compressible bytes.
Put the collation key generation entry points on CollationKeyGeneratorBorrowed. (The internals could be on CollatorBorrowed if CompressibleBytes is given as an argument to the internals. Or the internals could be on CollationKeyGeneratorBorrowed.)

@sffc @robertbastian @Manishearth , what are your thoughts?

robertbastian · 2025-05-19T08:48:23Z

we ourselves depend on semver-correctness

We don't really. The argument that it's unlikely that anyone is using 2.0 custom baked data already still holds. The CI check is pretty much a semver check, so it's fine to disable it for this push.

hsivonen · 2025-05-19T12:59:19Z

we ourselves depend on semver-correctness

We don't really. The argument that it's unlikely that anyone is using 2.0 custom baked data already still holds. The CI check is pretty much a semver check, so it's fine to disable it for this push.

@sffc, if you re-approve, please also land this.

sffc · 2025-05-19T16:09:31Z

Additional ICU4X WG discussion with @sffc @Manishearth @robertbastian

Potential solutions:

Append the compressible bytes to the existing zerovec now. Can introduce V2 in a future release.
Maintain V1 and V2 data, generate both, consume one in unstable. No adapter. Remove V1 in 2.1.
- Pro: Don't need to maintain V1 data for very long
- Con: breaks new icu_collator with old ~~icu_collator_data or~~ custom baked data OR requires carrying additional adapter code.
Just break it and yank
- Con: Yank doesn't work on existing lockfiles. Bad ecosystem citizenship.
Wait on adding this feature til 2.1. Use V1 V2 model with an adapter.
- Does not have the same con as (2) because of our ~ deps.
- Con: We've wanted this feature for ages.
- Con: Collator 2.1 needs to fill in data in the adapter if it is given V1 buffer data
- Con: Could result in version skew: for example, icu_collator ships Unicode 16 data in its adapter, but icu4x-datagen 2.0.0 could be run with Unicode 17.
Make a new type that wraps the existing type; CollationSearch. Uses a second data key.
- Con: More stack size (though it's unclear if it's that important for a mostly-singleton type)
- Con: More types.
Drop V1 and release 2.1 soon (this week)
- Pro: No semver problems
- Pro: Don't need to maintain V1 data
- Con: Technically breaks new-code-old-data
- Con: Not a good look for the project

@sffc The reason to land soon and release is we don't want to maintain this V1 V2 thing.
@robertbastian I think the ship on quick post-release fixes has sailed, Henri has release collator in idna, burntsushi has built a whole crate around icu4x 2.0 already
@sffc Agreed

Conclusion:

(For main) Proceed with "Option 1": Append the compressible bytes to the existing zerovec. When loading data, if the zerovec is too small, fill in Unicode 16 data hardcoded in icu_collator.
(Soon, for main, pre-2.1) Make sure icu_provider_source fails if generating data from a newer version of icuexportdata (maybe also cldr) without a --force type flag, in order to prevent accidentally generating Unicode 17 data with the shorter vec

LGTM: @sffc @robertbastian @Manishearth

To decide later:

In 2.1, change the V1 struct's baked invariants to require the data, and blob will need to allocate this data if it's missing
If we really want a release we can do a patch release after the PR is merged, otherwise this will be in 2.1

Un-approving based on the discussion above

hsivonen · 2025-05-20T09:07:49Z

Proceed with "Option 1": Append the compressible bytes to the existing zerovec.

The current ZeroVec has elements of u16. Is the intent still to eventually use compressible bytes as &[u8; 32] (as opposed to &[u16; 16])? What's the correct way to append raw bytes to a ZeroVec of u16 without accidentally causing each pair of bytes to flip?

(Also, I really need to work on other stuff this week. Sorry.)

sffc · 2025-05-20T12:13:57Z

You can just add the elements to the end of the ZeroVec<u16> as bytes, which works so long as there is an even number of bytes (since they need to be u16s).

https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=fdf57b4edbf073ef295f3729754bc855

robertbastian · 2025-05-20T13:39:58Z

You can just add the elements to the end of the ZeroVec as bytes, which works so long as there is an even number of bytes (since they need to be u16s).

I don't think that example is accurate. Presumably we start with a ZeroVec<u16>, where we'd have to go through to_bytes() and try_from_bytes to add elements of a different type. This is messy and requires knowing zerovec internals that I don't want this code to require.

Ideally the compressible bits should be packed in u16s instead of u8s, whether we group by 8 or 16 shouldn't make a difference.

sffc · 2025-05-20T13:55:39Z

The resulting bytes stored in the ZeroVec are the same no matter how you construct it, whether you do the from-bytes thing or if you pack the bits into u16s (which is the same since we store the bytes in little-endian order). I don't really care which exact ZeroVec APIs are used to build and destructure the thing.

robertbastian · 2025-05-20T14:14:07Z

The resulting bytes stored in the ZeroVec are the same no matter how you construct it

Don't disagree. I do disagree which level of abstraction we should use, and raw bytes probably isn't it.

which is the same since we store the bytes in little-endian order

Yes, but the next person to touch this code should not need to know this.

Manishearth · 2025-05-20T14:18:28Z

I will note that the little endian thing is not ZeroVec internals; it is a documented invariant that we expose via various APIs. However, we do not expose good construction APIs for this.

If Henri wishes to use u8s I think the thing to do is:

construct two separate ZeroVecs; one for the u16 data, and one for the compressed_bytes data
reinterpret the compressed bytes data as a u16 zv via parse_bytes
append to the u16 zerovec
use as_bytes() on the u16 zerovec, offset by 8, to get the original byte array back

This doesn't require knowledge of endianness, this just requires knowledge of ZeroVec's serialization stability.

robertbastian · 2025-05-20T14:24:24Z

Or we just use the zerovec as a vec and don't mess with the representation at all. The data here is bits, packing them into u16 instead of u8 changes nothing.

this just requires knowledge of ZeroVec's serialization stability.

There'd be a handful of people who can confidently review this, and who can confidently make changes to this in the future. I really don't see the point.

Manishearth · 2025-05-20T14:34:04Z

To be clear, I'm not disagreeing with your proposed route, but if Henri wishes to use u8s, there is a path that uses documented APIs.

Or we just use the zerovec as a vec and don't mess with the representation at all. The data here is bits, packing them into u16 instead of u8 changes nothing.

It's not always bits: He gets the data from ICU4C as u8s, so anything switching to u16s requires some part of the pipeline to think about endianness. ZeroVec is our go-to library for not having to think about endianness.

I don't think either proposed solution avoids having a mildly-tricky commented section of the code where we describe a u8 to u16 conversion.

sffc · 2025-05-20T14:36:58Z

Would it help if the ICU data was returned as an array of bools instead of packed u8s? I had suggested that in unicode-org/icu#3495. (It hasn't been released yet)

robertbastian · 2025-05-20T14:46:04Z

Yes. Otherwise the code I'd like to see would parse the packed ICU data into a sensible data type in datagen (i.e. [bool; 256]), and then pack it for ICU4X (I did something similar with the ICU Umm-Al-Qura, which has a custom packing format in ICU). If we can we should avoid parsing a custom ICU format.

sffc · 2025-05-20T15:33:42Z

unicode-org/icu#3498

components/collator/src/provider.rs

hsivonen · 2025-05-28T06:42:02Z

components/collator/src/comparison.rs

-        };
+            // Baked data without compressible bits, but not matching hardcoded data
+            return Err(
+                DataError::custom("cannot fall back to hardcoded compressible data")


I opened #6634 about the usefulness of errors like this as error results as opposed to panics. If this occurs, the app has been compiled in a bogus state, which isn't really a run-time-actionable thing.

hsivonen · 2025-05-28T06:44:07Z

components/collator/src/provider.rs

@@ -537,18 +556,68 @@ pub struct CollationSpecialPrimaries<'data> {
    /// character classes packed so that each fits in
    /// 16 bits. Length must match the number of enum
    /// variants in `MaxVariable`, currently 4.
+    ///
+    /// This is potentially followed by 256 bits


It would probably make sense to say that "potentially" is only about the case of using icu_collator_data 2.0.0 and in all other cases, it's a data generation bug for the extra data not to be there.

provider/source/src/collator/mod.rs

hsivonen

This PR is now for practical purposes authored by @robertbastian , but since GitHub treats me as the PR creator, GitHub won't let me approve.

LGTM with the inline nit, though.

Add compressible_bytes data

552c83d

As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used. Preparation for unicode-org#6537

hsivonen requested review from sffc, robertbastian, Manishearth, echeran and a team as code owners May 16, 2025 10:40

hsivonen added the C-collator Component: Collation, normalization label May 16, 2025

hsivonen mentioned this pull request May 16, 2025

Port sort key code from ICU4C (#2689) #6537

Open

Regenerate JSON data

39e5559

sffc previously approved these changes May 16, 2025

View reviewed changes

Use an array instead of ZeroVec for compressible_bytes

1ca5c94

hsivonen dismissed sffc’s stale review via 1ca5c94 May 19, 2025 12:35

Merge branch 'main' into compressiblebytes

2cc714c

robertbastian requested a review from sffc May 19, 2025 13:02

update versions

ee149c0

robertbastian force-pushed the compressiblebytes branch from ad54f4b to ee149c0 Compare May 19, 2025 13:53

sffc previously approved these changes May 19, 2025

View reviewed changes

robertbastian added 2 commits May 21, 2025 11:02

pack into existing zerovec

f9faf8f

fix

6462663

robertbastian requested review from sffc and removed request for robertbastian May 21, 2025 09:11

robertbastian added 2 commits May 21, 2025 11:40

bump icu tag

3c9db4a

use safe code for hardcoded data

42f77c9

hsivonen commented May 21, 2025

View reviewed changes

components/collator/src/provider.rs Show resolved Hide resolved

components/collator/src/provider.rs Outdated Show resolved Hide resolved

components/collator/src/provider.rs Outdated Show resolved Hide resolved

robertbastian added 2 commits May 21, 2025 15:06

validate to avoid zv indexing

b66a901

fix

e158a82

hsivonen commented May 28, 2025

View reviewed changes

provider/source/src/collator/mod.rs Outdated Show resolved Hide resolved

hsivonen commented May 28, 2025

View reviewed changes

binary literal

1e0b39f

robertbastian approved these changes May 28, 2025

View reviewed changes

Merge branch 'main' into compressiblebytes

c0f7f46

		#[cfg_attr(feature = "serde", serde(borrow))]
		pub compressible_bytes: ZeroVec<'data, u8>,

Add compressible_bytes data #6604

Are you sure you want to change the base?

Add compressible_bytes data #6604

Uh oh!

Conversation

hsivonen commented May 16, 2025

Uh oh!

gemini-code-assist bot commented May 16, 2025

Uh oh!

hsivonen commented May 16, 2025

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsivonen commented May 16, 2025

Uh oh!

hsivonen commented May 19, 2025

Uh oh!

robertbastian commented May 19, 2025

Uh oh!

hsivonen commented May 19, 2025

Uh oh!

sffc commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsivonen commented May 20, 2025

Uh oh!

sffc commented May 20, 2025

Uh oh!

robertbastian commented May 20, 2025

Uh oh!

sffc commented May 20, 2025

Uh oh!

robertbastian commented May 20, 2025

Uh oh!

Manishearth commented May 20, 2025

Uh oh!

robertbastian commented May 20, 2025

Uh oh!

Manishearth commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sffc commented May 20, 2025

Uh oh!

robertbastian commented May 20, 2025

Uh oh!

sffc commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsivonen left a comment

Choose a reason for hiding this comment

sffc commented May 19, 2025 •

edited

Loading

Manishearth commented May 20, 2025 •

edited

Loading