perf: serialize updated_fragment_offsets as RoaringBitmap bytes in proto#7432
Open
jerryjch wants to merge 1 commit into
Open
perf: serialize updated_fragment_offsets as RoaringBitmap bytes in proto#7432jerryjch wants to merge 1 commit into
jerryjch wants to merge 1 commit into
Conversation
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Contributor
Author
|
Hi, @wjones127 @pengw0048 I created a PR for #7080. Can you check this is a good approach for the issue? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes: #7080
Summary
Follow-up to #6650. The
updated_fragment_offsetsfield (proto field 9) stores per-fragmentmatched row offsets as
map<uint64, UInt32List>-- one uint32 per matched row. For denserewrites this produces multi-GB manifests (e.g. 86k matched rows x 4 bytes x many fragments).
This PR adds proto field 10 (
map<uint64, bytes>) using portable RoaringBitmap serialization,which typically compresses the same data to tens of bytes per fragment. Writers emit field 10
only; readers prefer field 10, falling back to field 9 for manifests written before this change.
Background
PR #6650 added
updated_fragment_offsetsto theUpdatetransaction message so thatbuild_manifestcan partially refresh_row_last_updated_at_versionfor matched rows only.The encoding choice -- one uint32 per offset in a
UInt32List-- was flagged post-merge as asize regression for dense updates. The offsets are already stored internally as
RoaringBitmap;this PR aligns the proto encoding with that representation.
Changes
protos/transaction.proto
map<uint64, UInt32List> updated_fragment_offsets) with a commentpointing to field 10.
map<uint64, bytes> updated_fragment_offset_bitmapswith documentation ofthe dual-read strategy.
rust/lance/src/dataset/transaction.rs
Serialization (
From<&Transaction> for pb::Transaction):RoaringBitmap::serialize_intoproduces portable bytes for eachfragment's bitmap.
HashMap(forward compat; old readers ignore unknown fields).Deserialization (
TryFrom<pb::Transaction> for Transaction):RoaringBitmap::deserialize_from.UInt32ListtoRoaringBitmap::from_iter(legacy fallback).
if !new_field.is_empty() { ... } else { ... }pattern used by the existingRewrite.groups/Rewrite.old_fragmentsmigration.Invalid field 10 bytes fail deserialize with
Error::invalid_input.In-memory type unchanged:
UpdatedFragmentOffsets(HashMap<u64, RoaringBitmap>).Test plan
test_proto_round_trip_field_10-- write a transaction with field 10, read back, verifyoffsets match for two fragments.
test_proto_legacy_field_9_read-- construct a proto with only field 9 populated(simulating an old writer), deserialize, verify offsets are correctly recovered.
test_proto_field_10_takes_precedence_over_field_9-- when both fields are present,field 10 values are used and field 9 is ignored.
Proto wire format change; team vote may be needed.
Backward compatibility
number reuse.
commits. Old Lance versions deserializing commits written by this PR will not recover
offsets from the txn blob; that only affects audit/
readTransaction()on historicalcommits, not table data or OCC.
versions that predate this change.
UpdatedFragmentOffsets) is unchanged; onlythe proto wire encoding changes.
Independent of #6748 and lance-spark #528 (JNI wiring). No mutual merge dependencies.