Skip to content

perf: serialize updated_fragment_offsets as RoaringBitmap bytes in proto#7432

Open
jerryjch wants to merge 1 commit into
lance-format:mainfrom
jerryjch:fix-for-issue-7080
Open

perf: serialize updated_fragment_offsets as RoaringBitmap bytes in proto#7432
jerryjch wants to merge 1 commit into
lance-format:mainfrom
jerryjch:fix-for-issue-7080

Conversation

@jerryjch

Copy link
Copy Markdown
Contributor

Fixes: #7080

Summary

Follow-up to #6650. The updated_fragment_offsets field (proto field 9) stores per-fragment
matched row offsets as map<uint64, UInt32List> -- one uint32 per matched row. For dense
rewrites this produces multi-GB manifests (e.g. 86k matched rows x 4 bytes x many fragments).

This PR adds proto field 10 (map<uint64, bytes>) using portable RoaringBitmap serialization,
which typically compresses the same data to tens of bytes per fragment. Writers emit field 10
only; readers prefer field 10, falling back to field 9 for manifests written before this change.

Background

PR #6650 added updated_fragment_offsets to the Update transaction message so that
build_manifest can partially refresh _row_last_updated_at_version for matched rows only.
The encoding choice -- one uint32 per offset in a UInt32List -- was flagged post-merge as a
size regression for dense updates. The offsets are already stored internally as RoaringBitmap;
this PR aligns the proto encoding with that representation.

Changes

protos/transaction.proto

  • Deprecate field 9 (map<uint64, UInt32List> updated_fragment_offsets) with a comment
    pointing to field 10.
  • Add field 10: map<uint64, bytes> updated_fragment_offset_bitmaps with documentation of
    the dual-read strategy.

rust/lance/src/dataset/transaction.rs

Serialization (From<&Transaction> for pb::Transaction):

  • Write field 10 only: RoaringBitmap::serialize_into produces portable bytes for each
    fragment's bitmap.
  • Set field 9 to an empty HashMap (forward compat; old readers ignore unknown fields).

Deserialization (TryFrom<pb::Transaction> for Transaction):

  • If field 10 is non-empty: deserialize each entry with RoaringBitmap::deserialize_from.
  • Else if field 9 is non-empty: convert each UInt32List to RoaringBitmap::from_iter
    (legacy fallback).
  • Same if !new_field.is_empty() { ... } else { ... } pattern used by the existing
    Rewrite.groups / Rewrite.old_fragments migration.

Invalid field 10 bytes fail deserialize with Error::invalid_input.

In-memory type unchanged: UpdatedFragmentOffsets(HashMap<u64, RoaringBitmap>).

Test plan

  • test_proto_round_trip_field_10 -- write a transaction with field 10, read back, verify
    offsets match for two fragments.
  • test_proto_legacy_field_9_read -- construct a proto with only field 9 populated
    (simulating an old writer), deserialize, verify offsets are correctly recovered.
  • test_proto_field_10_takes_precedence_over_field_9 -- when both fields are present,
    field 10 values are used and field 9 is ignored.

Proto wire format change; team vote may be needed.

Backward compatibility

  • Proto field numbers: field 9 is kept (deprecated, not removed). Field 10 is new. No field
    number reuse.
  • Old readers: ignore unknown field 10; they only read field 9, which is now empty on new
    commits. Old Lance versions deserializing commits written by this PR will not recover
    offsets from the txn blob; that only affects audit/readTransaction() on historical
    commits, not table data or OCC.
  • New readers: prefer field 10; fall back to field 9 for manifests written by older Lance
    versions that predate this change.
  • No JNI or Java changes. The in-memory type (UpdatedFragmentOffsets) is unchanged; only
    the proto wire encoding changes.

Independent of #6748 and lance-spark #528 (JNI wiring). No mutual merge dependencies.

@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-format On-disk format: protos and format spec docs performance labels Jun 24, 2026
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.84615% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/transaction.rs 93.84% 6 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@jerryjch

Copy link
Copy Markdown
Contributor Author

Hi, @wjones127 @pengw0048 I created a PR for #7080. Can you check this is a good approach for the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Operation::Update storing matched row offsets as flat repeated uint32 makes dense full-table column rewrites prohibitively large

1 participant