-
Notifications
You must be signed in to change notification settings - Fork 299
feat: add bulk insertion to deletion vector #1578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add bulk insertion to deletion vector #1578
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dentiny for this pr, generally looks good!
crates/iceberg/src/delete_vector.rs
Outdated
pub fn append_positions(&mut self, positions: &[u64]) -> bool { | ||
let expected_num = positions.len(); | ||
let appended_num = self.inner.append(positions.iter().copied()).unwrap(); | ||
appended_num as usize == expected_num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition seems odd to me, which treats insertation with duplication as a failure. I would suggest to return nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment!
Yeah I think it's too specific for my use case (which requires no duplicates and ordering).
In the latest commit, I updated the return value to # of items inserted:
- It matches the interface for roaring bitmap
- For my own usage, I do want to assert no duplicate insertions
Let me know if you're fine with it, thank you for the quick review and constructive comments!
crates/iceberg/src/delete_vector.rs
Outdated
/// Precondition: The values of the iterator must be ordered and strictly greater than the greatest value in the set. | ||
/// If a value in the iterator doesn’t satisfy this requirement, it is not added and the append operation is stopped. | ||
#[allow(dead_code)] | ||
pub fn insert_positions(&mut self, positions: &[u64]) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some misunderstading here. What I mean is:
- If the inner
append
fail, we should also return this error. - If we succeeded, we should return the actual number inserted, rather the len of
positions
. The difference is, if theinner
already contains some value inpositions
, the returned length should substract the common part.
Please add detailed doc for the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please add some tests for the case where there are intersections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the inner append fail, we should also return this error.
I considered it:
- I met some problems to wrap it inside of iceberg error, for example, I don't want to add another error category for this particular error type and may lose information (successfully inserted rows)
- Returning the number of successfully inserted rows is a well-defined behavior, and it's always caller's responsibility to check whether row insertion works
Another consideration here is whether we want users to keep insertion upon failure:
- Returning an error, without inserted rows being specified, means deletion vector could be at a broken state, and users are not supposed to keep insertion;
- Returning the number of rows inserted indicates clearly which part gets inserted and which parts not, so it's actually legal to keep insertion after updating the input arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please add some tests for the case where there are intersections.
I actually already have certain cases:
iceberg-rust/crates/iceberg/src/delete_vector.rs
Lines 176 to 181 in b297231
// Perform another bulk insertion after an incomplete one. | |
assert_eq!(dv.insert_positions(&[2, 4]), 0); | |
let mut collected: Vec<u64> = dv.iter().collect(); | |
collected.sort(); | |
assert_eq!(collected, expected_positions); |
Do you think that's enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the doc and impl to return Result
, which I wrap iceberg::Result
around the returned result;
let me know if you think it makes sense :)
Thank you so much for the explanation and quick review!
1b1127f
to
aefed79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dentiny for this pr, LGTM! Just one nit about error handling.
Thank you so much for the careful review! I learnt a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dentiny for this pr!
What changes are included in this PR?
In this PR, I added a bulk insertion API to deletion vector and roaring bitmap.
Context:
append
provides better perfAre these changes tested?
Yes, unit tests added.