Skip to content

feat: add bulk insertion to deletion vector #1578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Aug 3, 2025

What changes are included in this PR?

In this PR, I added a bulk insertion API to deletion vector and roaring bitmap.

Context:

Are these changes tested?

Yes, unit tests added.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dentiny for this pr, generally looks good!

pub fn append_positions(&mut self, positions: &[u64]) -> bool {
let expected_num = positions.len();
let appended_num = self.inner.append(positions.iter().copied()).unwrap();
appended_num as usize == expected_num
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition seems odd to me, which treats insertation with duplication as a failure. I would suggest to return nothing.

Copy link
Contributor Author

@dentiny dentiny Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment!
Yeah I think it's too specific for my use case (which requires no duplicates and ordering).

In the latest commit, I updated the return value to # of items inserted:

  • It matches the interface for roaring bitmap
  • For my own usage, I do want to assert no duplicate insertions

Let me know if you're fine with it, thank you for the quick review and constructive comments!

@dentiny dentiny requested a review from liurenjie1024 August 4, 2025 17:28
@dentiny dentiny requested a review from liurenjie1024 August 6, 2025 09:25
/// Precondition: The values of the iterator must be ordered and strictly greater than the greatest value in the set.
/// If a value in the iterator doesn’t satisfy this requirement, it is not added and the append operation is stopped.
#[allow(dead_code)]
pub fn insert_positions(&mut self, positions: &[u64]) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some misunderstading here. What I mean is:

  1. If the inner append fail, we should also return this error.
  2. If we succeeded, we should return the actual number inserted, rather the len of positions. The difference is, if the inner already contains some value in positions, the returned length should substract the common part.

Please add detailed doc for the behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add some tests for the case where there are intersections.

Copy link
Contributor Author

@dentiny dentiny Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the inner append fail, we should also return this error.

I considered it:

  • I met some problems to wrap it inside of iceberg error, for example, I don't want to add another error category for this particular error type and may lose information (successfully inserted rows)
  • Returning the number of successfully inserted rows is a well-defined behavior, and it's always caller's responsibility to check whether row insertion works

Another consideration here is whether we want users to keep insertion upon failure:

  • Returning an error, without inserted rows being specified, means deletion vector could be at a broken state, and users are not supposed to keep insertion;
  • Returning the number of rows inserted indicates clearly which part gets inserted and which parts not, so it's actually legal to keep insertion after updating the input arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add some tests for the case where there are intersections.

I actually already have certain cases:

// Perform another bulk insertion after an incomplete one.
assert_eq!(dv.insert_positions(&[2, 4]), 0);
let mut collected: Vec<u64> = dv.iter().collect();
collected.sort();
assert_eq!(collected, expected_positions);

Do you think that's enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc and impl to return Result, which I wrap iceberg::Result around the returned result;
let me know if you think it makes sense :)
Thank you so much for the explanation and quick review!

@dentiny dentiny force-pushed the hjiang/bulk-insertion-deletion-vector branch from 1b1127f to aefed79 Compare August 6, 2025 18:37
@dentiny dentiny requested a review from liurenjie1024 August 6, 2025 18:38
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dentiny for this pr, LGTM! Just one nit about error handling.

@dentiny
Copy link
Contributor Author

dentiny commented Aug 7, 2025

Just one nit about error handling.

Thank you so much for the careful review! I learnt a lot.

@dentiny dentiny requested a review from liurenjie1024 August 7, 2025 14:43
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dentiny for this pr!

@liurenjie1024 liurenjie1024 merged commit 50cd2ab into apache:main Aug 8, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants