Skip to content

Port sort key code from ICU4C (#2689) #6537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

rs-sac
Copy link

@rs-sac rs-sac commented May 6, 2025

I mostly translate the C++ code straight across, but I made some minor changes to match the mindset of the Rust codebase, to match Rust practice, to satisfy Clippy, and to reduce copy-and-paste.

A comment on the issue suggested handling the identical level by appending UTF-8. That's not what ICU4C does, but it's a whole lot simpler, so that's what I did.

At the end, some simple tests are included.

Fixes #2689

@rs-sac rs-sac requested review from hsivonen, echeran and a team as code owners May 6, 2025 16:21
@CLAassistant
Copy link

CLAassistant commented May 6, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @rs-sac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request ports the sort key code from ICU4C to Rust, adding a write_sort_key API to the Collator. The code was largely translated directly from C++, with minor adjustments to align with Rust conventions and address Clippy recommendations. The author implemented a simpler approach for handling the identical level by appending UTF-8, differing from ICU4C's method. The author also added some simple tests.

Highlights

  • Porting Sort Key Code: The primary focus is porting the sort key generation algorithm from ICU4C's C++ implementation to Rust, enhancing the icu_collator crate.
  • API Addition: A new write_sort_key API is added to the Collator struct, enabling the generation of sort keys.
  • UTF-8 Normalization: The implementation handles identical level comparisons by appending UTF-8 NFD normalization, a simplification compared to ICU4C.
  • Write Trait: A new Write trait is introduced to allow writing to buffer-like objects, since this crate does not have access to std.

Changelog

Click here to see the changelog
  • CHANGELOG.md
    • Added a changelog entry for the icu_collator crate, noting the addition of the write_sort_key API.
  • components/collator/src/comparison.rs
    • Added extern crate alloc and use alloc::vec::Vec to enable dynamic memory allocation.
    • Introduced constants for special sort key bytes, level separators, and compression terminators.
    • Defined an enum Level for specifying comparison levels.
    • Added the write_sort_key function to write sort key bytes up to the collator's strength.
    • Added the write_sort_key_up_to_quaternary function to write sort key bytes up to the collator's strength, with options to write the case level and stop writing levels based on a provided function.
    • Added the Write trait and implementations for Vec<u8> and SmallVec<[u8; N]>.
    • Added the SortKeyLevel struct to manage sort key levels.
    • Refactored compare_impl to use helper functions for tailoring and numeric primary handling.
    • Added helper functions tailoring_or_root, numeric_primary, collation_elements, and variable_top to improve code organization and readability.
  • components/collator/src/elements.rs
    • Added NO_CE_QUATERNARY constant.
    • Added ignorable function to NonPrimary struct.
  • components/collator/src/options.rs
    • No functional changes, only a minor comment update for clarity.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


From C++ to Rust,
A sort key finds new life,
Strings now find their place.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the sort key code from ICU4C to Rust, which is a significant enhancement for the icu_collator crate. The author has made a commendable effort to translate the C++ code while adapting it to Rust conventions and addressing Clippy's suggestions. The addition of tests and a line-by-line audit further contribute to the reliability of the code. Overall, this is a valuable contribution.

Summary of Findings

  • Missing compressibleBytes and minLevel: The author intentionally left out compressibleBytes and minLevel from the C++ code. It's important to confirm whether these omissions are acceptable and won't affect the functionality or performance of the ported code. Further investigation may be required to determine if these components are necessary for the Rust implementation.
  • Potential performance bottlenecks: The code contains several loops and macro usages that could potentially lead to performance bottlenecks. While the author has made an effort to optimize the code, it's important to carefully analyze these sections and identify any areas where further optimization may be necessary. Consider using profiling tools to measure the performance of the code and identify any hotspots.
  • Inconsistent naming: There are inconsistencies in the naming of variables and constants throughout the code. For example, some constants use snake_case while others use SCREAMING_SNAKE_CASE. It's important to establish a consistent naming convention and apply it throughout the code to improve readability and maintainability.

Merge Readiness

The pull request introduces a significant new feature and includes tests, which is a positive sign. However, the compressibleBytes and minLevel omissions need to be addressed, and the potential performance bottlenecks should be investigated. I am unable to approve this pull request, and recommend that the pull request not be merged until the high severity issues are addressed. Other reviewers should also review and approve this code before merging.

@rs-sac
Copy link
Author

rs-sac commented May 6, 2025

I see that most of the test failures boil down to my calling new_nfd outside of a #[cfg(feature = "compiled_data")]. If you could advise me on what I did wrong (call something else, add the feature attribute there, do something else), I'll endeavor to fix it.

@rs-sac rs-sac requested a review from Manishearth as a code owner May 6, 2025 19:36
@Manishearth Manishearth removed their request for review May 7, 2025 00:29
@Manishearth
Copy link
Member

(I don't tend to review collator stuff)

@hsivonen hsivonen added the C-collator Component: Collation, normalization label May 7, 2025
@hsivonen
Copy link
Member

hsivonen commented May 7, 2025

Thank you for the PR!

I acknowledge that this is my area to review, but I couldn't get to it today.

@sffc
Copy link
Member

sffc commented May 9, 2025

Thanks for the contribution! A few things:

  1. I received confirmation that your Corporate CLA was received. You can select the appropriate option for yourself on cla-assistant.io.
  2. This PR was opened just hours before Identical prefix optimization for the collator #6496 landed, so this PR will need to be merged/rebased.
  3. About the data provider CI errors: if you call any function that is #[cfg(feature = "compiled_data")], then your function also needs to be #[cfg(feature = "compiled_data")], and it should have a corresponding _with_buffer_provider and _unstable. We usually have an internal function that takes the resolved objects and then we write these three functions as thin wrappers on top of the internal function.

@Manishearth
Copy link
Member

3. About the data provider CI errors: if you call any function that is #[cfg(feature = "compiled_data")], then your function also needs to be #[cfg(feature = "compiled_data")], and it should have a corresponding _with_buffer_provider and _unstable. We usually have an internal function that takes the resolved objects and then we write these three functions as thin wrappers on top of the internal function.

I think in this case this is not the advice to follow: this code is deep within the implementation; it needs to use the decomposition data that is available within the Collator/CollatorBorrowed structs, and if it needs more then these structs should load more during data loading.

I don't have time to really go deeper into how it looks, but other people here should be able to.

@sffc
Copy link
Member

sffc commented May 9, 2025

I assume it's code like this which is causing the error:

impl CollatorBorrowed<'_> {
    /// Write the sort key bytes up to the collator's strength.
    ///
    /// For identical strength, the UTF-8 NFD normalization is appended for breaking ties.
    /// The key ends with a `TERMINATOR_BYTE`.
    pub fn write_sort_key<S>(&self, s: &str, sink: &mut S)
    where
        S: Write,
    {
        self.write_sort_key_up_to_quaternary(s.chars(), sink, |_| true);

        if self.options.strength() == Strength::Identical {
            let nfd = icu_normalizer::DecomposingNormalizerBorrowed::new_nfd();
            let s = nfd.normalize(s);
            sink.write_byte(LEVEL_SEPARATOR_BYTE);
            sink.write(s.as_bytes());
        }

        sink.write(&[TERMINATOR_BYTE]);
    }
}

If you need data beyond what is in the Collator, then you might need to load that data in the Collator constructor, or else create a new struct.

However, I think the data you need might already be there. CollatorBorrowed already contains a field decompositions: &'a DecompositionData<'a>, which you can use to create a Decomposition (see line 594) and presumably also a DecomposingNormalizerBorrowed.

@rs-sac
Copy link
Author

rs-sac commented May 9, 2025

Thanks for the pointers. I think I have fixed the problem, though I await output from the official tests.

@rs-sac
Copy link
Author

rs-sac commented May 9, 2025

Rather than add to the temporary diplomat exclusion file, I added my stuff to the more documented allow list.

I'm not an expert on diplomat, but I'm pretty sure it doesn't make sense to expose the internal use constructor I created. OTOH, ideally the sort key code would be exposed, but AFAICT as a non-expert, that can't happen due to using traits.

(Since I don't know what reviewers are necessary, I didn't add or change any reviewers, even though this web site says I did. It's completely making that up.)

@hsivonen
Copy link
Member

However, I think the data you need might already be there. CollatorBorrowed already contains a field decompositions: &'a DecompositionData<'a>, which you can use to create a Decomposition (see line 594) and presumably also a DecomposingNormalizerBorrowed.

The data is indeed already there. The collator already creates Decomposition using the data it already has.

It looks like a pub (but probably doc-hidden) constructor for creating a DecomposingNormalizerBorrowed from the data that the collator already holds is needed (plugging None and the correct u8 constants for NFD in the fields that need input not held by the collator).


fn collation_elements<'a, I>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should optimize to not causing a move of CollationElements, but rerunning benches is needed to confirm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a regression. Inlining wasn't enough to fix it; I had to macroize it.

@hsivonen
Copy link
Member

However, I think the data you need might already be there. CollatorBorrowed already contains a field decompositions: &'a DecompositionData<'a>, which you can use to create a Decomposition (see line 594) and presumably also a DecomposingNormalizerBorrowed.

The data is indeed already there. The collator already creates Decomposition using the data it already has.

It looks like a pub (but probably doc-hidden) constructor for creating a DecomposingNormalizerBorrowed from the data that the collator already holds is needed (plugging None and the correct u8 constants for NFD in the fields that need input not held by the collator).

Ah, nevermind, it's already in the changeset.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, but I have some change requests inline.

I think the biggest concern is planning correctly around the compressibility table that ICU4X currently doesn't have.

I intend to discuss that aspect with other ICU4X developers on Thursday.

// primary level but terminate compression on all levels and then exit the loop.
if p > NO_CE_PRIMARY && levels & PRIMARY_LEVEL_FLAG != 0 {
// Test the un-reordered primary for compressibility.
let is_compressible = false; // TODO?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Indeed, we don't have data for this. I'm assuming that for comparison outcomes, it's OK to treat all primaries as uncompressible, but it would be good to get confirmation from @markusicu .

If we decide to add data for compressibility, such an addition would have inconvenient implications on whether the sort key functionality should then move to a struct other than CollatorBorrowed that loads more data than CollatorBorrowed. If we allowed the creation of collation key with or without compressibility data, we'd end up with two kinds of mutually-incompatible collation keys within a given ICU4X release unless the design is such that comparing compressed with uncompressed works. @markusicu , is the design such that it's OK to compare a collation key created with some primary compression with a collation key created with all primaries treated as uncompressible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this more. Here's what I think we should (not necessarily in this PR but before releasing without a non-experimental flag):

  • Make icuexportdata export 256 bits of compressibleBytes data for the root. (AFAICT, only the root has these bits.) Perhaps export this as an array of 256 bytes each of which is 1 or 0 without packing into an array of four 64-bit integers on the ICU4C side and leave the packing to datagen.
  • Introduce CollationKeyGeneratorBorrowed and CollationKeyGenerator that hold a CollatorBorrowed plus & [u64; 4] and Collator plus [u64; 4], respectively.
  • Introduce a singleton data key for compressibly bytes, which is a table of 256 bits represented as [u64; 4].
  • Put the public entry points for collation key generation on CollationKeyGeneratorBorrowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • and Collator plus [u64; 4],

With the [u64; 4] inside a data struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the top-level comment with minutes from the meeting. Let's proceed without creating a separate CollationKeyGeneratorBorrowed struct.

I'll create a PR to update CollationSpecialPrimaries with a new field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is #6604

self.write(&[b]);
}

/// Write leading bytes up to a zero byte, but always write at least one byte.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this is always called with an array of three bytes whose first byte is known to be non-zero, so do we need to check here for writing at least one byte?

Also, wouldn't it be better to have this as a helper function outside the trait so that trait implementors don't get to override this method with a broken implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's one call site with at most 3 bytes and it already checks the first byte for zero, it seems to me that nested ifs to check the other two bytes and writing each using a per-byte call would work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

/// Write the sort key bytes up to the collator's strength.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before release, this needs documentation covering at least:

  • A non-key comparison with the collator typically entails a single primary-level comparison after code unit-level identical prefix skipping, so it takes a lot of comparisons to amortize the performance cost of generating collation keys. Applications should not use collation keys as a performance optimization without measuring that the performance cost of generating the collation keys is actually amortized.
  • The stored collation keys should be presumed to become invalidated by a CLDR update, a Unicode Database update (i.e. new version of Unicode), or an update to the ICU4X code. Applications must, therefore, be prepared to recompute all collation keys and take the performance cost of doing so into account when considering whether the performance cost of collation keys is amortized.
  • It has happened more than once in the past with other collator implementations that a database has stored collation keys and then the database has grown so large that regenerating the collation keys has become too expensive, which has then resulted in engineering effort to use a frozen copy of collation key generation algorithm and the associated data while making other updates to software. Users of ICU4X should not expect ICU4X to accommodate this kind of need if a user of ICU4X gets into this situation despite this warning.

Adding the warnings can be a follow-up, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hsivonen hsivonen added the discuss-priority Discuss at the next ICU4X meeting label May 14, 2025
@hsivonen
Copy link
Member

A couple of things that aren't really review comments but more general discussion points:

  1. Should the sink trait be called Write, because traits roughly like that are customarily called Write, or should the trait be called something else so that it's obvious from the unqualified name that it's not one of the other Writes?
  2. Should the methods on the trait return Result even though when implemented for Vec and SmallVec, the implementation would always return Ok(())? Is it valuable to be able to write fallible sinks when the internal SmallVecs can panic on OOM anyway? In the case of the normalizer, it's less likely to get internal SmallVec OOM, and since the sink trait can propagate OOM, I ended up propagating OOM in my WIP Gecko patch.

@rs-sac
Copy link
Author

rs-sac commented May 14, 2025

  1. Should the sink trait be called Write, because traits roughly like that are customarily called Write, or should the trait be called something else so that it's obvious from the unqualified name that it's not one of the other Writes?

As a newbie, I don't have a strong opinion. I took the name suggested in the original issue.

  1. Should the methods on the trait return Result even though when implemented for Vec and SmallVec, the implementation would always return Ok(())? Is it valuable to be able to write fallible sinks when the internal SmallVecs can panic on OOM anyway?

I had the same thoughts. It appeared to me that while the C++ interface supports errors, they were all related to OOM, which is just a crash in Rust (lame IMO, but nobody asked my opinion on that point). Since I didn't see any plausible scenario wherein an error would be returned, I skipped the error business, thinking that if I'd overlooked anything, it would be pointed out in review.

Thanks for the rest of the comments. It will take me a little while to work through them all.

@hsivonen
Copy link
Member

From the meeting today:

  • @hsivonen Overall, high quality. I found two cases of what appears to be minor typos. There are three things I'd like input on. The most substantive one is, this will need 256 bits of data exported from ICU4C into a new kind of data struct that we don't currently have. If we follow the ICU4X philosophy properly, then it would mean that this method would need to be on a different object, implying things like CollationKeyGenerator, CollationKeyGeneratorBorrowed. What should we do in this case?
  • @sffc 256 bits, not even bytes?
  • @hsivonen Yes. ICU4C uses bytes, but we can use bits.
  • @robertbastian I think we should add a new data marker, update the unstable constructor, and fall back to hardcoded data in the buffer constructor
  • @hsivonen We should ask @markusicu to verify whether the algorithm still works if we say if is_compressible is always false. And we should also verify what happens if you generate collation keys with the compressible data and without the compressible data. I'm not sure how the compression works, so I can't say. I think we should have a singleton data key for the compressibility of the bytes, and that's a data struct that contains an array of four u64s. If we fail to load that data struct, do we fail the construction, or do we will in all zeros? If we really want ICU4X 2.x to ingest 2.0 data that doesn't have this, and construct successfully, we have to say to fill in all zeros. But I think it's better to say that if the data key doesn't resolve, you don't get to produce collation keys, which is probably the better way forward.
  • @sffc Ideally we would just include these 256 bits in the main data struct. We don't have a black and white rule on when we split things into different data structs and different types, but generally if something needs data that is very small (less than ~1 KB, certainly less than ~100 B) and closely related to the functionality of the main type, it can just go onto the main type.
  • @hsivonen We could make a SpecialPrimariesV2 with the extra field.
  • @robertbastian we're allowed to add a new marker, and remove the old marker from the unstable constructor. our plan for custom baked data and buffer data is to keep the old key around, but given that we only released a week ago, it's probably fine to not do that
  • @hsivonen What zerovec type should I use? Do we have a fixed-length array?
  • @sffc You can use a VarZeroCow<(...)> if you want
  • @robertbastian I'd just use a ZeroVec<u8>
  • @hsivonen I'll use ZeroVec<u8>
  • @sffc Did we decide what to actually do?
  • @hsivonen I'll add a field compressible_bytes in SpecialPrimaries struct, and I update the marker name from V1 to V2.
  • @sffc Yeah, we should be strict with our semver policy, and the exception we're making is only in our "new code old data" policy, which I'm comfortable making.
  • @robertbastian intialize compressible_bytes with the CLDR 47 data in datagen if icuexportdata doesn't contain the data. then in the future add it to icuexportdata

@rs-sac
Copy link
Author

rs-sac commented May 15, 2025

Due to my putting a CHANGELOG entry into my patches, I had to rebase. I removed the entry because it sounded as if you folks would probably prefer to word it yourselves. The rest of the patches are the same. Then I added more commits on top because I read that's the style you prefer here.

I believe I have addressed all of the feedback, including the comments from the meeting about the trait's name and fallibility. I left your comments open in case you wanted to only resolve them after you check the new patches.

Regarding upgrades, you might want to think about how to communicate potential sort key incompatibility to users. Should that be via semver? I added the requested dire warnings about regeneration, but it seems a little unreasonable if there's no guarantee at all about compatibility between versions X.Y.0 and X.Y.1.

@rs-sac rs-sac requested a review from hsivonen May 15, 2025 19:50
@srl295 srl295 self-requested a review May 15, 2025 22:22
hsivonen added a commit to hsivonen/icu4x that referenced this pull request May 16, 2025
As a side effect, always load CollationSpecialPrimaries, since we no longer
know at collator instantiation time if some of the data in the struct is
going to be used.

Preparation for unicode-org#6537
@hsivonen
Copy link
Member

#6604 adds the compressibility data, hopefully correctly.

@rs-sac
Copy link
Author

rs-sac commented May 16, 2025

I have addressed the additional review comments (other than compressible bytes).

@hsivonen
Copy link
Member

I'll be unavailable tomorrow and for most of next week, so I'm marking this approved in order to unblock landing once #6604 has landed. There are two things remaining:

  1. Adding the query to compressible_bytes
  2. @srl295 's take on the specifics of marking the provenance as port from ICU4C

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the two things noted in the previous comment. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization discuss-priority Discuss at the next ICU4X meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider exposing sort keys
5 participants