feat: resolve data overlay files on the take (and scan) read path#7409
Draft
wjones127 wants to merge 8 commits into
Draft
feat: resolve data overlay files on the take (and scan) read path#7409wjones127 wants to merge 8 commits into
wjones127 wants to merge 8 commits into
Conversation
Add a specification for data overlay files: small files attached to a fragment that supply new values for a subset of (row offset, field) cells without rewriting the base data files, for cheap cell-level updates. - protos/table.proto: rework DataOverlayFile with a dense/sparse coverage oneof (shared_offset_bitmap vs new FieldCoverage), rename read_version to committed_version (effective, commit-stamped), and document rank-based addressing with no offset column. Document reader feature flag 64. - docs: add data_overlay_file.md (full spec, worked example, guidance stub) and link it from the table format overview. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the `DataOverlay` operation (and `DataOverlayGroup`) to attach overlay files to fragments without rewriting their base data. Mirrors the `DataReplacement` batch shape, appends to each fragment's `overlays` list, and documents permissive conflict semantics: concurrent overlays, appends, deletes, and column rewrites are compatible; row-rewrites, compaction, and overlay->base folds conflict. committed_version is left 0 by the writer and stamped at commit time. Proto only — Rust/Python bindings deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The table/transaction proto changes generate new fields and an Operation variant. This wires the minimum needed to compile without implementing overlay support: - Emit empty `overlays` when converting fragments to proto. - Reject the `DataOverlay` transaction operation with NotSupported on read. Datasets that use overlays set reader feature flag 64, which already falls in the unknown-flag range rejected by `can_read_dataset`, so the library refuses them at the feature-flag layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the in-memory + commit machinery for data overlay files (per the spec in lance-format#7381), the foundation the scanner/take/index/compaction work builds on. - `DataOverlayFile` / `OverlayCoverage` (dense `shared_offset_bitmap` and sparse per-field) with protobuf round-trip, attached to `Fragment.overlays`. - Reader feature flag 64 (`FLAG_DATA_OVERLAY_FILES`): set whenever any fragment carries overlays, so a reader that does not understand them refuses the dataset instead of returning stale base values. - `Operation::DataOverlay` transaction op: appends overlays to a fragment's list (preserving concurrently-written overlays) and stamps each overlay's `committed_version` to the new dataset version at commit time (re-stamped on retry). Conflict rules mirror DataReplacement — permissive against appends, deletes, column rewrites, index builds, and other overlays; conflicts only with row-rewriting compaction of the same fragment. Scan-side merge, take, and end-to-end write+read tests follow in the same PR branch. Part of the Data Overlay Files feature (OSS-1322). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that overlays can be committed, a scan or take over a fragment that has overlays would silently return stale base values, since the read-path merge is not implemented yet. Refuse such reads at `FileFragment::open` with a clear error instead of serving incorrect data. Lifted once the scan/take merge lands (rest of OSS-1322 / OSS-1324). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `dataset::overlay`, the tested heart of reading overlays: given a base column for a physical row range and the overlays covering a field (newest first), `resolve_overlay_column` produces the merged column. An offset is resolved to the newest covering overlay's value at the offset's rank in the coverage bitmap; an uncovered offset falls through to the base; a covered offset whose value is NULL overrides the cell to NULL. `overlay_indices_newest_first` orders a fragment's overlays by `committed_version` then list position. Deletion precedence needs no handling here: the merge runs before the deletion filter, so an overlay value for a deleted offset is computed and dropped with the row. Wiring this into the scan stream and `take` follows on this branch. Unit tests cover rank addressing, multi-overlay precedence, NULL override vs. fall-through, physical-offset base, string columns, and ordering. Part of OSS-1322. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The v2 file writer advanced every column from a single global row counter, so a single file could only hold columns of equal length. Sparse data overlay files need columns whose item counts differ within one file (each field covers a different set of rows). Add `FileWriter::write_columns`, which writes a set of `(field, array)` pairs and advances each field's row counter independently, leaving other fields untouched. A field never written ends up as a zero-length column. `write_batch` is unchanged: it still advances all fields together, so ordinary rectangular files round-trip exactly as before. Per-column lengths were already derivable from page metadata; expose them via `FileReader::column_num_rows`. The reader already schedules each column from its own pages, so reading a column at its own length and random access within it work without further changes. Part of the Data Overlay Files feature (OSS-1323). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the overlay cell-resolution core into reads so `take` (and scan) return merged values. `FileFragment::open` loads each projected field's overlay value columns newest-first, and `new_read_impl`/`read_ranges` merge them into base batches by physical offset before deletion filtering — so resolution is identical for take and scan, NULL overrides apply, deletions take precedence (an overlay value for a deleted row is dropped with the row), and fields resolve independently. The merge addresses each row by its physical offset (via `ReadBatchParams::to_offsets_total`), so the resolution core now takes explicit per-row offsets instead of a contiguous start — a single code path for the contiguous scan range and arbitrary take indices. Sparse per-field overlays read each field's value column independently, so unequal-length columns (OSS-1323) are handled without materializing a rectangular batch. Removes the temporary "overlays not supported" guard from OSS-1322. Tests: take of covered/uncovered offsets, multiple overlays (newest-wins), per-field coverage with unequal-length columns, NULL override, overlay on a deleted row (inert), and multi-fragment scan — all over v2.0 and v2.1 files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This was referenced Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves the
takerandom-access path against data overlay files, replacing the temporary "overlays not supported" error from #7407 (OSS-1322). Implements OSS-1324.Because
takeand scan shareFragmentReader::new_read_impl, the merge is wired there once: each row is addressed by its physical offset (fromReadBatchParams::to_offsets_total) and resolved against the overlays that cover its field. This necessarily also enables the scan-path merge that #7407 stubbed out.How it works
FileFragment::openloads, for each projected field, the overlay value columns that cover it, ordered newest-first (bycommitted_version, list-position tiebreak).Overlays on nested (non-top-level) fields are not yet matched and are left for follow-up.
Tests
take covered/uncovered offsets; multiple overlays (newest wins); per-field coverage with unequal-length columns; NULL override; overlay on a deleted row (inert); multi-fragment scan — each over v2.0 and v2.1. Plus unit tests for the offset-based core and the batch merge.
Stacking
Stacked on #7406 (OSS-1323) and #7407 (OSS-1322). Until those merge, this PR's diff against
mainincludes their commits; review only the final commit ("resolve data overlay files on the take and scan read paths").🤖 Generated with Claude Code