Skip to content

Add compressible_bytes data #6604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

hsivonen
Copy link
Member

As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used.

Preparation for #6537

As a side effect, always load CollationSpecialPrimaries, since we no longer
know at collator instantiation time if some of the data in the struct is
going to be used.

Preparation for unicode-org#6537
@hsivonen hsivonen added the C-collator Component: Collation, normalization label May 16, 2025
Copy link

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

@hsivonen
Copy link
Member Author

Deferring to @robertbastian for what to do about the crates.io-dependent test-tutorials task.

sffc
sffc previously approved these changes May 16, 2025
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some TODOs in the code, which would presumably be resolved when unicode-org/icu/pull/3495 lands

Comment on lines 1533 to 1535
let field = self.arr[usize::from(b >> 3)];
let mask = 1 << (b & 0b111);
(field & mask) != 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to correctly reverse the compression mapping in unicode-org/icu/pull/3495

Comment on lines 569 to 570
#[cfg_attr(feature = "serde", serde(borrow))]
pub compressible_bytes: ZeroVec<'data, u8>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): Since this struct now contains 2 fixed-length ZeroVecs, it would be slightly more efficient to represent them as a single ZeroVec, or possibly a MultiFieldsULE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other ZeroVec isn't logically fixed-length for all time, but it has 4 items for now and has had 4 items for a long time.

Also, the other one has u16s, and I felt uneasy about introducing bugs to the old thing if I tweak it in a hurry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a ZeroVec is sized at 3 × usize, which isn't much more efficient than just using a [u8; 32] here.

If you're trying to reduce stack size via indirection, perhaps we could go further and amortize costs? This is essentially the same thing as Shane's suggestion around collapsing the zerovecs, under the hood, but done in a way that allows you to expand last_primaries and also avoids funky indexing/bit math.

#[make_varule(CollationSpecialPrimariesInternalULE)]
struct CollationSpecialPrimariesInternal<'data> {
    numeric_primary: u8,
    bytes: [u8; 32],
    last_primaries: ZeroVec<'data, u16>,
}

struct CollationSpecialPrimaries {
   internal: VarZeroCow<'data, CollationSpecialPrimariesInternalULE>
}

The current working of make_varule will lead to CollationSpecialPrimariesInternalULE having a .bytes() method that returns a stack copy of bytes. We don't actually want that here, it's a big type. A 4-register copy isn't a huge deal, but worth avoiding.

I have a couple ideas for designs for this, but broadly speaking this would be a good time to introduce some further control over these generated getters. A simple thing would be to add a #[zerovec::ref_getter] attribute to bytes that makes it return a reference instead. We should put a bit of thought into what the full extent of configurability looks like here, so we can design the attribute syntax holistically (zerovec::getter(ref)? Also should we be doing something special for AsULE<ULE = Self> types?) but

or possibly a MultiFieldsULE

We're not putting this in a VarZeroVec, so a VarULE type isn't that useful on its own. It can't be stuck flat into the struct above. Perhaps paired with a VarZeroCow, but the VarZeroCow has a similar stack cost as a VarZeroVec; we only really gain from it if we collapse the entire type as I sketch out above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which isn't much more efficient than just using a [u8; 32] here.

Are we allowed to use [u8; 32] directly in a data struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as it's Copy. The reason not to use it would be stack size, not sure how relevant this is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: if you can get the compressible bytes to fit in [u8; 31] instead of [u8; 32], then it can fit into the padding of the CollationSpecialPrimaries and the stack size is equal to what it would be if you added another ZeroVec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't know enough of what guarantees FractionalUCA/genuca make to know whether [u8; 31] would work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as it's Copy. The reason not to use it would be stack size, not sure how relevant this is here.

I didn't know that. I've refreshed the PR.

Stack size how? The data struct itself is still behind a reference, right?

Stack size might be relevant for the jamo table. The obvious follow-up question is: should the jamo table have been a direct array instead of ZeroVec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the borrowed version it's behind a reference. In the owned version, the DataPayload contains the data struct (plus an RC to the buffer).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't know enough of what guarantees FractionalUCA/genuca make to know whether [u8; 31] would work.

Currently, the last item is non-zero and first item is zero.

@hsivonen
Copy link
Member Author

Some TODOs in the code, which would presumably be resolved when unicode-org/icu/pull/3495 lands

The TODOs in datagen can be fixed independently of anything else. I didn't add datagen-level length enforcement here, because we don't have it for the jamo table. Also, datagen-level length enforcement isn't a must-have, because we have run-time length enforcement.

@hsivonen
Copy link
Member Author

So the reason why this didn't get merged on Friday is that we ourselves depend on semver-correctness, so perhaps we should go back to my initial suggestion and do this by the book instead of trying a semver-violation in an effort to reduce the total number of structs.

I need to focus on other things this week, but the following would be by the book and surely landable from a semver standpoint:

  1. From this PR, keep the CompressibleBytes accessor wrapper type and the notion of hardcoding ICU4C 77.x values for compressible bytes in datagen when actual TOML is missing.
  2. Keep Collator loading SpecialPrimaries conditionally as today on main.
  3. Introduce a new singleton data struct just for compressible bytes.
  4. Introduce CollationKeyGenerator/CollationKeyGeneratorBorrowed that wraps Collator/CollatorBorrowed plus the data for compressible bytes.
  5. Put the collation key generation entry points on CollationKeyGeneratorBorrowed. (The internals could be on CollatorBorrowed if CompressibleBytes is given as an argument to the internals. Or the internals could be on CollationKeyGeneratorBorrowed.)

@sffc @robertbastian @Manishearth , what are your thoughts?

@robertbastian
Copy link
Member

we ourselves depend on semver-correctness

We don't really. The argument that it's unlikely that anyone is using 2.0 custom baked data already still holds. The CI check is pretty much a semver check, so it's fine to disable it for this push.

@hsivonen
Copy link
Member Author

we ourselves depend on semver-correctness

We don't really. The argument that it's unlikely that anyone is using 2.0 custom baked data already still holds. The CI check is pretty much a semver check, so it's fine to disable it for this push.

@sffc, if you re-approve, please also land this.

@robertbastian robertbastian requested a review from sffc May 19, 2025 13:02
sffc
sffc previously approved these changes May 19, 2025
@sffc
Copy link
Member

sffc commented May 19, 2025

Additional ICU4X WG discussion with @sffc @Manishearth @robertbastian

Potential solutions:

  1. Append the compressible bytes to the existing zerovec now. Can introduce V2 in a future release.
  2. Maintain V1 and V2 data, generate both, consume one in unstable. No adapter. Remove V1 in 2.1.
    • Pro: Don't need to maintain V1 data for very long
    • Con: breaks new icu_collator with old icu_collator_data or custom baked data OR requires carrying additional adapter code.
  3. Just break it and yank
    • Con: Yank doesn't work on existing lockfiles. Bad ecosystem citizenship.
  4. Wait on adding this feature til 2.1. Use V1 V2 model with an adapter.
    • Does not have the same con as (2) because of our ~ deps.
    • Con: We've wanted this feature for ages.
    • Con: Collator 2.1 needs to fill in data in the adapter if it is given V1 buffer data
    • Con: Could result in version skew: for example, icu_collator ships Unicode 16 data in its adapter, but icu4x-datagen 2.0.0 could be run with Unicode 17.
  5. Make a new type that wraps the existing type; CollationSearch. Uses a second data key.
    • Con: More stack size (though it's unclear if it's that important for a mostly-singleton type)
    • Con: More types.
  6. Drop V1 and release 2.1 soon (this week)
    • Pro: No semver problems
    • Pro: Don't need to maintain V1 data
    • Con: Technically breaks new-code-old-data
    • Con: Not a good look for the project
  • @sffc The reason to land soon and release is we don't want to maintain this V1 V2 thing.
  • @robertbastian I think the ship on quick post-release fixes has sailed, Henri has release collator in idna, burntsushi has built a whole crate around icu4x 2.0 already
  • @sffc Agreed

Conclusion:

  • (For main) Proceed with "Option 1": Append the compressible bytes to the existing zerovec. When loading data, if the zerovec is too small, fill in Unicode 16 data hardcoded in icu_collator.
  • (Soon, for main, pre-2.1) Make sure icu_provider_source fails if generating data from a newer version of icuexportdata (maybe also cldr) without a --force type flag, in order to prevent accidentally generating Unicode 17 data with the shorter vec

LGTM: @sffc @robertbastian @Manishearth

To decide later:

  • In 2.1, change the V1 struct's baked invariants to require the data, and blob will need to allocate this data if it's missing
  • If we really want a release we can do a patch release after the PR is merged, otherwise this will be in 2.1

@sffc sffc dismissed their stale review May 19, 2025 16:09

Un-approving based on the discussion above

@hsivonen
Copy link
Member Author

  • Proceed with "Option 1": Append the compressible bytes to the existing zerovec.

The current ZeroVec has elements of u16. Is the intent still to eventually use compressible bytes as &[u8; 32] (as opposed to &[u16; 16])? What's the correct way to append raw bytes to a ZeroVec of u16 without accidentally causing each pair of bytes to flip?

(Also, I really need to work on other stuff this week. Sorry.)

@sffc
Copy link
Member

sffc commented May 20, 2025

You can just add the elements to the end of the ZeroVec<u16> as bytes, which works so long as there is an even number of bytes (since they need to be u16s).

https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=fdf57b4edbf073ef295f3729754bc855

@robertbastian
Copy link
Member

You can just add the elements to the end of the ZeroVec as bytes, which works so long as there is an even number of bytes (since they need to be u16s).

I don't think that example is accurate. Presumably we start with a ZeroVec<u16>, where we'd have to go through to_bytes() and try_from_bytes to add elements of a different type. This is messy and requires knowing zerovec internals that I don't want this code to require.

Ideally the compressible bits should be packed in u16s instead of u8s, whether we group by 8 or 16 shouldn't make a difference.

@sffc
Copy link
Member

sffc commented May 20, 2025

The resulting bytes stored in the ZeroVec are the same no matter how you construct it, whether you do the from-bytes thing or if you pack the bits into u16s (which is the same since we store the bytes in little-endian order). I don't really care which exact ZeroVec APIs are used to build and destructure the thing.

@robertbastian
Copy link
Member

The resulting bytes stored in the ZeroVec are the same no matter how you construct it

Don't disagree. I do disagree which level of abstraction we should use, and raw bytes probably isn't it.

which is the same since we store the bytes in little-endian order

Yes, but the next person to touch this code should not need to know this.

@Manishearth
Copy link
Member

I will note that the little endian thing is not ZeroVec internals; it is a documented invariant that we expose via various APIs. However, we do not expose good construction APIs for this.

If Henri wishes to use u8s I think the thing to do is:

  • construct two separate ZeroVecs; one for the u16 data, and one for the compressed_bytes data
  • reinterpret the compressed bytes data as a u16 zv via parse_bytes
  • append to the u16 zerovec
  • use as_bytes() on the u16 zerovec, offset by 8, to get the original byte array back

This doesn't require knowledge of endianness, this just requires knowledge of ZeroVec's serialization stability.

@robertbastian
Copy link
Member

Or we just use the zerovec as a vec and don't mess with the representation at all. The data here is bits, packing them into u16 instead of u8 changes nothing.

this just requires knowledge of ZeroVec's serialization stability.

There'd be a handful of people who can confidently review this, and who can confidently make changes to this in the future. I really don't see the point.

@Manishearth
Copy link
Member

Manishearth commented May 20, 2025

To be clear, I'm not disagreeing with your proposed route, but if Henri wishes to use u8s, there is a path that uses documented APIs.

Or we just use the zerovec as a vec and don't mess with the representation at all. The data here is bits, packing them into u16 instead of u8 changes nothing.

It's not always bits: He gets the data from ICU4C as u8s, so anything switching to u16s requires some part of the pipeline to think about endianness. ZeroVec is our go-to library for not having to think about endianness.

I don't think either proposed solution avoids having a mildly-tricky commented section of the code where we describe a u8 to u16 conversion.

@sffc
Copy link
Member

sffc commented May 20, 2025

Would it help if the ICU data was returned as an array of bools instead of packed u8s? I had suggested that in unicode-org/icu#3495. (It hasn't been released yet)

@robertbastian
Copy link
Member

Yes. Otherwise the code I'd like to see would parse the packed ICU data into a sensible data type in datagen (i.e. [bool; 256]), and then pack it for ICU4X (I did something similar with the ICU Umm-Al-Qura, which has a custom packing format in ICU). If we can we should avoid parsing a custom ICU format.

@sffc
Copy link
Member

sffc commented May 20, 2025

unicode-org/icu#3498

@robertbastian robertbastian requested review from sffc and removed request for robertbastian May 21, 2025 09:11
};
// Baked data without compressible bits, but not matching hardcoded data
return Err(
DataError::custom("cannot fall back to hardcoded compressible data")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #6634 about the usefulness of errors like this as error results as opposed to panics. If this occurs, the app has been compiled in a bogus state, which isn't really a run-time-actionable thing.

@@ -537,18 +556,68 @@ pub struct CollationSpecialPrimaries<'data> {
/// character classes packed so that each fits in
/// 16 bits. Length must match the number of enum
/// variants in `MaxVariable`, currently 4.
///
/// This is potentially followed by 256 bits
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably make sense to say that "potentially" is only about the case of using icu_collator_data 2.0.0 and in all other cases, it's a data generation bug for the extra data not to be there.

Copy link
Member Author

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is now for practical purposes authored by @robertbastian , but since GitHub treats me as the PR creator, GitHub won't let me approve.

LGTM with the inline nit, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants