-
Notifications
You must be signed in to change notification settings - Fork 214
Port sort key code from ICU4C (#2689) #6537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @rs-sac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request ports the sort key code from ICU4C to Rust, adding a write_sort_key
API to the Collator
. The code was largely translated directly from C++, with minor adjustments to align with Rust conventions and address Clippy recommendations. The author implemented a simpler approach for handling the identical level by appending UTF-8, differing from ICU4C's method. The author also added some simple tests.
Highlights
- Porting Sort Key Code: The primary focus is porting the sort key generation algorithm from ICU4C's C++ implementation to Rust, enhancing the
icu_collator
crate. - API Addition: A new
write_sort_key
API is added to theCollator
struct, enabling the generation of sort keys. - UTF-8 Normalization: The implementation handles identical level comparisons by appending UTF-8 NFD normalization, a simplification compared to ICU4C.
- Write Trait: A new
Write
trait is introduced to allow writing to buffer-like objects, since this crate does not have access tostd
.
Changelog
Click here to see the changelog
- CHANGELOG.md
- Added a changelog entry for the
icu_collator
crate, noting the addition of thewrite_sort_key
API.
- Added a changelog entry for the
- components/collator/src/comparison.rs
- Added
extern crate alloc
anduse alloc::vec::Vec
to enable dynamic memory allocation. - Introduced constants for special sort key bytes, level separators, and compression terminators.
- Defined an enum
Level
for specifying comparison levels. - Added the
write_sort_key
function to write sort key bytes up to the collator's strength. - Added the
write_sort_key_up_to_quaternary
function to write sort key bytes up to the collator's strength, with options to write the case level and stop writing levels based on a provided function. - Added the
Write
trait and implementations forVec<u8>
andSmallVec<[u8; N]>
. - Added the
SortKeyLevel
struct to manage sort key levels. - Refactored
compare_impl
to use helper functions for tailoring and numeric primary handling. - Added helper functions
tailoring_or_root
,numeric_primary
,collation_elements
, andvariable_top
to improve code organization and readability.
- Added
- components/collator/src/elements.rs
- Added
NO_CE_QUATERNARY
constant. - Added
ignorable
function toNonPrimary
struct.
- Added
- components/collator/src/options.rs
- No functional changes, only a minor comment update for clarity.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
From C++ to Rust,
A sort key finds new life,
Strings now find their place.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request ports the sort key code from ICU4C to Rust, which is a significant enhancement for the icu_collator
crate. The author has made a commendable effort to translate the C++ code while adapting it to Rust conventions and addressing Clippy's suggestions. The addition of tests and a line-by-line audit further contribute to the reliability of the code. Overall, this is a valuable contribution.
Summary of Findings
- Missing
compressibleBytes
andminLevel
: The author intentionally left outcompressibleBytes
andminLevel
from the C++ code. It's important to confirm whether these omissions are acceptable and won't affect the functionality or performance of the ported code. Further investigation may be required to determine if these components are necessary for the Rust implementation. - Potential performance bottlenecks: The code contains several loops and macro usages that could potentially lead to performance bottlenecks. While the author has made an effort to optimize the code, it's important to carefully analyze these sections and identify any areas where further optimization may be necessary. Consider using profiling tools to measure the performance of the code and identify any hotspots.
- Inconsistent naming: There are inconsistencies in the naming of variables and constants throughout the code. For example, some constants use snake_case while others use SCREAMING_SNAKE_CASE. It's important to establish a consistent naming convention and apply it throughout the code to improve readability and maintainability.
Merge Readiness
The pull request introduces a significant new feature and includes tests, which is a positive sign. However, the compressibleBytes
and minLevel
omissions need to be addressed, and the potential performance bottlenecks should be investigated. I am unable to approve this pull request, and recommend that the pull request not be merged until the high severity issues are addressed. Other reviewers should also review and approve this code before merging.
I see that most of the test failures boil down to my calling |
(I don't tend to review collator stuff) |
Thank you for the PR! I acknowledge that this is my area to review, but I couldn't get to it today. |
Thanks for the contribution! A few things:
|
I think in this case this is not the advice to follow: this code is deep within the implementation; it needs to use the decomposition data that is available within the Collator/CollatorBorrowed structs, and if it needs more then these structs should load more during data loading. I don't have time to really go deeper into how it looks, but other people here should be able to. |
I assume it's code like this which is causing the error: impl CollatorBorrowed<'_> {
/// Write the sort key bytes up to the collator's strength.
///
/// For identical strength, the UTF-8 NFD normalization is appended for breaking ties.
/// The key ends with a `TERMINATOR_BYTE`.
pub fn write_sort_key<S>(&self, s: &str, sink: &mut S)
where
S: Write,
{
self.write_sort_key_up_to_quaternary(s.chars(), sink, |_| true);
if self.options.strength() == Strength::Identical {
let nfd = icu_normalizer::DecomposingNormalizerBorrowed::new_nfd();
let s = nfd.normalize(s);
sink.write_byte(LEVEL_SEPARATOR_BYTE);
sink.write(s.as_bytes());
}
sink.write(&[TERMINATOR_BYTE]);
}
} If you need data beyond what is in the Collator, then you might need to load that data in the Collator constructor, or else create a new struct. However, I think the data you need might already be there. |
Thanks for the pointers. I think I have fixed the problem, though I await output from the official tests. |
Rather than add to the temporary diplomat exclusion file, I added my stuff to the more documented allow list. I'm not an expert on diplomat, but I'm pretty sure it doesn't make sense to expose the internal use constructor I created. OTOH, ideally the sort key code would be exposed, but AFAICT as a non-expert, that can't happen due to using traits. (Since I don't know what reviewers are necessary, I didn't add or change any reviewers, even though this web site says I did. It's completely making that up.) |
The data is indeed already there. The collator already creates It looks like a |
|
||
fn collation_elements<'a, I>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should optimize to not causing a move of CollationElements
, but rerunning benches is needed to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a regression. Inlining wasn't enough to fix it; I had to macroize it.
Ah, nevermind, it's already in the changeset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, but I have some change requests inline.
I think the biggest concern is planning correctly around the compressibility table that ICU4X currently doesn't have.
I intend to discuss that aspect with other ICU4X developers on Thursday.
// primary level but terminate compression on all levels and then exit the loop. | ||
if p > NO_CE_PRIMARY && levels & PRIMARY_LEVEL_FLAG != 0 { | ||
// Test the un-reordered primary for compressibility. | ||
let is_compressible = false; // TODO? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Indeed, we don't have data for this. I'm assuming that for comparison outcomes, it's OK to treat all primaries as uncompressible, but it would be good to get confirmation from @markusicu .
If we decide to add data for compressibility, such an addition would have inconvenient implications on whether the sort key functionality should then move to a struct other than CollatorBorrowed
that loads more data than CollatorBorrowed
. If we allowed the creation of collation key with or without compressibility data, we'd end up with two kinds of mutually-incompatible collation keys within a given ICU4X release unless the design is such that comparing compressed with uncompressed works. @markusicu , is the design such that it's OK to compare a collation key created with some primary compression with a collation key created with all primaries treated as uncompressible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this more. Here's what I think we should (not necessarily in this PR but before releasing without a non-experimental flag):
- Make icuexportdata export 256 bits of
compressibleBytes
data for the root. (AFAICT, only the root has these bits.) Perhaps export this as an array of 256 bytes each of which is 1 or 0 without packing into an array of four 64-bit integers on the ICU4C side and leave the packing to datagen. - Introduce
CollationKeyGeneratorBorrowed
andCollationKeyGenerator
that hold aCollatorBorrowed
plus& [u64; 4]
andCollator
plus[u64; 4]
, respectively. - Introduce a singleton data key for compressibly bytes, which is a table of 256 bits represented as
[u64; 4]
. - Put the public entry points for collation key generation on
CollationKeyGeneratorBorrowed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- and
Collator
plus[u64; 4]
,
With the [u64; 4] inside a data struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the top-level comment with minutes from the meeting. Let's proceed without creating a separate CollationKeyGeneratorBorrowed
struct.
I'll create a PR to update CollationSpecialPrimaries
with a new field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is #6604
self.write(&[b]); | ||
} | ||
|
||
/// Write leading bytes up to a zero byte, but always write at least one byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this is always called with an array of three bytes whose first byte is known to be non-zero, so do we need to check here for writing at least one byte?
Also, wouldn't it be better to have this as a helper function outside the trait so that trait implementors don't get to override this method with a broken implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's one call site with at most 3 bytes and it already checks the first byte for zero, it seems to me that nested if
s to check the other two bytes and writing each using a per-byte call would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
/// Write the sort key bytes up to the collator's strength. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before release, this needs documentation covering at least:
- A non-key comparison with the collator typically entails a single primary-level comparison after code unit-level identical prefix skipping, so it takes a lot of comparisons to amortize the performance cost of generating collation keys. Applications should not use collation keys as a performance optimization without measuring that the performance cost of generating the collation keys is actually amortized.
- The stored collation keys should be presumed to become invalidated by a CLDR update, a Unicode Database update (i.e. new version of Unicode), or an update to the ICU4X code. Applications must, therefore, be prepared to recompute all collation keys and take the performance cost of doing so into account when considering whether the performance cost of collation keys is amortized.
- It has happened more than once in the past with other collator implementations that a database has stored collation keys and then the database has grown so large that regenerating the collation keys has become too expensive, which has then resulted in engineering effort to use a frozen copy of collation key generation algorithm and the associated data while making other updates to software. Users of ICU4X should not expect ICU4X to accommodate this kind of need if a user of ICU4X gets into this situation despite this warning.
Adding the warnings can be a follow-up, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A couple of things that aren't really review comments but more general discussion points:
|
As a newbie, I don't have a strong opinion. I took the name suggested in the original issue.
I had the same thoughts. It appeared to me that while the C++ interface supports errors, they were all related to OOM, which is just a crash in Rust (lame IMO, but nobody asked my opinion on that point). Since I didn't see any plausible scenario wherein an error would be returned, I skipped the error business, thinking that if I'd overlooked anything, it would be pointed out in review. Thanks for the rest of the comments. It will take me a little while to work through them all. |
From the meeting today:
|
Due to my putting a CHANGELOG entry into my patches, I had to rebase. I removed the entry because it sounded as if you folks would probably prefer to word it yourselves. The rest of the patches are the same. Then I added more commits on top because I read that's the style you prefer here. I believe I have addressed all of the feedback, including the comments from the meeting about the trait's name and fallibility. I left your comments open in case you wanted to only resolve them after you check the new patches. Regarding upgrades, you might want to think about how to communicate potential sort key incompatibility to users. Should that be via semver? I added the requested dire warnings about regeneration, but it seems a little unreasonable if there's no guarantee at all about compatibility between versions X.Y.0 and X.Y.1. |
As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used. Preparation for unicode-org#6537
#6604 adds the compressibility data, hopefully correctly. |
I have addressed the additional review comments (other than compressible bytes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending the two things noted in the previous comment. Thank you!
I mostly translate the C++ code straight across, but I made some minor changes to match the mindset of the Rust codebase, to match Rust practice, to satisfy Clippy, and to reduce copy-and-paste.
A comment on the issue suggested handling the identical level by appending UTF-8. That's not what ICU4C does, but it's a whole lot simpler, so that's what I did.
At the end, some simple tests are included.
Fixes #2689