-
Notifications
You must be signed in to change notification settings - Fork 214
Add compressible_bytes data #6604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used. Preparation for unicode-org#6537
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback. |
Deferring to @robertbastian for what to do about the crates.io-dependent test-tutorials task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some TODOs in the code, which would presumably be resolved when unicode-org/icu/pull/3495 lands
let field = self.arr[usize::from(b >> 3)]; | ||
let mask = 1 << (b & 0b111); | ||
(field & mask) != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to correctly reverse the compression mapping in unicode-org/icu/pull/3495
components/collator/src/provider.rs
Outdated
#[cfg_attr(feature = "serde", serde(borrow))] | ||
pub compressible_bytes: ZeroVec<'data, u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (optional): Since this struct now contains 2 fixed-length ZeroVecs, it would be slightly more efficient to represent them as a single ZeroVec, or possibly a MultiFieldsULE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other ZeroVec
isn't logically fixed-length for all time, but it has 4 items for now and has had 4 items for a long time.
Also, the other one has u16
s, and I felt uneasy about introducing bugs to the old thing if I tweak it in a hurry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that a ZeroVec is sized at 3 × usize
, which isn't much more efficient than just using a [u8; 32]
here.
If you're trying to reduce stack size via indirection, perhaps we could go further and amortize costs? This is essentially the same thing as Shane's suggestion around collapsing the zerovecs, under the hood, but done in a way that allows you to expand last_primaries
and also avoids funky indexing/bit math.
#[make_varule(CollationSpecialPrimariesInternalULE)]
struct CollationSpecialPrimariesInternal<'data> {
numeric_primary: u8,
bytes: [u8; 32],
last_primaries: ZeroVec<'data, u16>,
}
struct CollationSpecialPrimaries {
internal: VarZeroCow<'data, CollationSpecialPrimariesInternalULE>
}
The current working of make_varule
will lead to CollationSpecialPrimariesInternalULE
having a .bytes()
method that returns a stack copy of bytes
. We don't actually want that here, it's a big type. A 4-register copy isn't a huge deal, but worth avoiding.
I have a couple ideas for designs for this, but broadly speaking this would be a good time to introduce some further control over these generated getters. A simple thing would be to add a #[zerovec::ref_getter]
attribute to bytes
that makes it return a reference instead. We should put a bit of thought into what the full extent of configurability looks like here, so we can design the attribute syntax holistically (zerovec::getter(ref)
? Also should we be doing something special for AsULE<ULE = Self>
types?) but
or possibly a MultiFieldsULE
We're not putting this in a VarZeroVec, so a VarULE type isn't that useful on its own. It can't be stuck flat into the struct above. Perhaps paired with a VarZeroCow, but the VarZeroCow has a similar stack cost as a VarZeroVec; we only really gain from it if we collapse the entire type as I sketch out above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which isn't much more efficient than just using a
[u8; 32]
here.
Are we allowed to use [u8; 32]
directly in a data struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as it's Copy
. The reason not to use it would be stack size, not sure how relevant this is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: if you can get the compressible bytes to fit in [u8; 31]
instead of [u8; 32]
, then it can fit into the padding of the CollationSpecialPrimaries
and the stack size is equal to what it would be if you added another ZeroVec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't know enough of what guarantees FractionalUCA/genuca make to know whether [u8; 31]
would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as it's
Copy
. The reason not to use it would be stack size, not sure how relevant this is here.
I didn't know that. I've refreshed the PR.
Stack size how? The data struct itself is still behind a reference, right?
Stack size might be relevant for the jamo table. The obvious follow-up question is: should the jamo table have been a direct array instead of ZeroVec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the borrowed version it's behind a reference. In the owned version, the DataPayload
contains the data struct (plus an RC to the buffer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't know enough of what guarantees FractionalUCA/genuca make to know whether
[u8; 31]
would work.
Currently, the last item is non-zero and first item is zero.
The TODOs in datagen can be fixed independently of anything else. I didn't add datagen-level length enforcement here, because we don't have it for the jamo table. Also, datagen-level length enforcement isn't a must-have, because we have run-time length enforcement. |
So the reason why this didn't get merged on Friday is that we ourselves depend on semver-correctness, so perhaps we should go back to my initial suggestion and do this by the book instead of trying a semver-violation in an effort to reduce the total number of structs. I need to focus on other things this week, but the following would be by the book and surely landable from a semver standpoint:
@sffc @robertbastian @Manishearth , what are your thoughts? |
We don't really. The argument that it's unlikely that anyone is using 2.0 custom baked data already still holds. The CI check is pretty much a semver check, so it's fine to disable it for this push. |
@sffc, if you re-approve, please also land this. |
ad54f4b
to
ee149c0
Compare
Additional ICU4X WG discussion with @sffc @Manishearth @robertbastian Potential solutions:
Conclusion:
LGTM: @sffc @robertbastian @Manishearth To decide later:
|
The current (Also, I really need to work on other stuff this week. Sorry.) |
You can just add the elements to the end of the |
I don't think that example is accurate. Presumably we start with a Ideally the compressible bits should be packed in |
The resulting bytes stored in the ZeroVec are the same no matter how you construct it, whether you do the from-bytes thing or if you pack the bits into u16s (which is the same since we store the bytes in little-endian order). I don't really care which exact ZeroVec APIs are used to build and destructure the thing. |
Don't disagree. I do disagree which level of abstraction we should use, and raw bytes probably isn't it.
Yes, but the next person to touch this code should not need to know this. |
I will note that the little endian thing is not ZeroVec internals; it is a documented invariant that we expose via various APIs. However, we do not expose good construction APIs for this. If Henri wishes to use u8s I think the thing to do is:
This doesn't require knowledge of endianness, this just requires knowledge of ZeroVec's serialization stability. |
Or we just use the zerovec as a vec and don't mess with the representation at all. The data here is bits, packing them into
There'd be a handful of people who can confidently review this, and who can confidently make changes to this in the future. I really don't see the point. |
To be clear, I'm not disagreeing with your proposed route, but if Henri wishes to use u8s, there is a path that uses documented APIs.
It's not always bits: He gets the data from ICU4C as u8s, so anything switching to u16s requires some part of the pipeline to think about endianness. ZeroVec is our go-to library for not having to think about endianness. I don't think either proposed solution avoids having a mildly-tricky commented section of the code where we describe a u8 to u16 conversion. |
Would it help if the ICU data was returned as an array of bools instead of packed u8s? I had suggested that in unicode-org/icu#3495. (It hasn't been released yet) |
Yes. Otherwise the code I'd like to see would parse the packed ICU data into a sensible data type in datagen (i.e. |
}; | ||
// Baked data without compressible bits, but not matching hardcoded data | ||
return Err( | ||
DataError::custom("cannot fall back to hardcoded compressible data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #6634 about the usefulness of errors like this as error results as opposed to panics. If this occurs, the app has been compiled in a bogus state, which isn't really a run-time-actionable thing.
@@ -537,18 +556,68 @@ pub struct CollationSpecialPrimaries<'data> { | |||
/// character classes packed so that each fits in | |||
/// 16 bits. Length must match the number of enum | |||
/// variants in `MaxVariable`, currently 4. | |||
/// | |||
/// This is potentially followed by 256 bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably make sense to say that "potentially" is only about the case of using icu_collator_data
2.0.0 and in all other cases, it's a data generation bug for the extra data not to be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is now for practical purposes authored by @robertbastian , but since GitHub treats me as the PR creator, GitHub won't let me approve.
LGTM with the inline nit, though.
As a side effect, always load CollationSpecialPrimaries, since we no longer know at collator instantiation time if some of the data in the struct is going to be used.
Preparation for #6537