Skip to content

feat: resolve data overlay files on the take (and scan) read path#7409

Draft
wjones127 wants to merge 8 commits into
lance-format:mainfrom
wjones127:will/oss-1324-take-can-read-overlays
Draft

feat: resolve data overlay files on the take (and scan) read path#7409
wjones127 wants to merge 8 commits into
lance-format:mainfrom
wjones127:will/oss-1324-take-can-read-overlays

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Resolves the take random-access path against data overlay files, replacing the temporary "overlays not supported" error from #7407 (OSS-1322). Implements OSS-1324.

Because take and scan share FragmentReader::new_read_impl, the merge is wired there once: each row is addressed by its physical offset (from ReadBatchParams::to_offsets_total) and resolved against the overlays that cover its field. This necessarily also enables the scan-path merge that #7407 stubbed out.

How it works

  • FileFragment::open loads, for each projected field, the overlay value columns that cover it, ordered newest-first (by committed_version, list-position tiebreak).
  • The merge runs on physical rows in read order, before deletion filtering, so:
    • deletions take precedence (an overlay value computed for a deleted row is dropped with the row),
    • NULL overrides apply (a covered offset with a NULL value resolves to NULL, distinct from fall-through),
    • fields resolve independently.
  • The resolution core now takes explicit per-row physical offsets instead of a contiguous start, giving one code path for the contiguous scan range and arbitrary take indices.
  • Sparse per-field overlays read each field's value column independently, so unequal-length value columns (feat(file): v2 writer/reader support columns of unequal length #7406 / OSS-1323) need no rectangular batch. Rank-based addressing only (rank on the coverage bitmap + a value fetch; no offset key column, no binary search).

Overlays on nested (non-top-level) fields are not yet matched and are left for follow-up.

Tests

take covered/uncovered offsets; multiple overlays (newest wins); per-field coverage with unequal-length columns; NULL override; overlay on a deleted row (inert); multi-fragment scan — each over v2.0 and v2.1. Plus unit tests for the offset-based core and the batch merge.

Stacking

Stacked on #7406 (OSS-1323) and #7407 (OSS-1322). Until those merge, this PR's diff against main includes their commits; review only the final commit ("resolve data overlay files on the take and scan read paths").

🤖 Generated with Claude Code

wjones127 and others added 8 commits June 22, 2026 18:07
Add a specification for data overlay files: small files attached to a
fragment that supply new values for a subset of (row offset, field) cells
without rewriting the base data files, for cheap cell-level updates.

- protos/table.proto: rework DataOverlayFile with a dense/sparse coverage
  oneof (shared_offset_bitmap vs new FieldCoverage), rename read_version to
  committed_version (effective, commit-stamped), and document rank-based
  addressing with no offset column. Document reader feature flag 64.
- docs: add data_overlay_file.md (full spec, worked example, guidance stub)
  and link it from the table format overview.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the `DataOverlay` operation (and `DataOverlayGroup`) to attach overlay
files to fragments without rewriting their base data. Mirrors the
`DataReplacement` batch shape, appends to each fragment's `overlays` list, and
documents permissive conflict semantics: concurrent overlays, appends, deletes,
and column rewrites are compatible; row-rewrites, compaction, and overlay->base
folds conflict.

committed_version is left 0 by the writer and stamped at commit time.

Proto only — Rust/Python bindings deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The table/transaction proto changes generate new fields and an Operation
variant. This wires the minimum needed to compile without implementing overlay
support:

- Emit empty `overlays` when converting fragments to proto.
- Reject the `DataOverlay` transaction operation with NotSupported on read.

Datasets that use overlays set reader feature flag 64, which already falls in
the unknown-flag range rejected by `can_read_dataset`, so the library refuses
them at the feature-flag layer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the in-memory + commit machinery for data overlay files (per the spec in
lance-format#7381), the foundation the scanner/take/index/compaction work builds on.

- `DataOverlayFile` / `OverlayCoverage` (dense `shared_offset_bitmap` and sparse
  per-field) with protobuf round-trip, attached to `Fragment.overlays`.
- Reader feature flag 64 (`FLAG_DATA_OVERLAY_FILES`): set whenever any fragment
  carries overlays, so a reader that does not understand them refuses the
  dataset instead of returning stale base values.
- `Operation::DataOverlay` transaction op: appends overlays to a fragment's
  list (preserving concurrently-written overlays) and stamps each overlay's
  `committed_version` to the new dataset version at commit time (re-stamped on
  retry). Conflict rules mirror DataReplacement — permissive against appends,
  deletes, column rewrites, index builds, and other overlays; conflicts only
  with row-rewriting compaction of the same fragment.

Scan-side merge, take, and end-to-end write+read tests follow in the same PR
branch.

Part of the Data Overlay Files feature (OSS-1322).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that overlays can be committed, a scan or take over a fragment that has
overlays would silently return stale base values, since the read-path merge is
not implemented yet. Refuse such reads at `FileFragment::open` with a clear
error instead of serving incorrect data. Lifted once the scan/take merge lands
(rest of OSS-1322 / OSS-1324).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `dataset::overlay`, the tested heart of reading overlays: given a base
column for a physical row range and the overlays covering a field (newest
first), `resolve_overlay_column` produces the merged column. An offset is
resolved to the newest covering overlay's value at the offset's rank in the
coverage bitmap; an uncovered offset falls through to the base; a covered
offset whose value is NULL overrides the cell to NULL. `overlay_indices_newest_first`
orders a fragment's overlays by `committed_version` then list position.

Deletion precedence needs no handling here: the merge runs before the deletion
filter, so an overlay value for a deleted offset is computed and dropped with
the row. Wiring this into the scan stream and `take` follows on this branch.

Unit tests cover rank addressing, multi-overlay precedence, NULL override vs.
fall-through, physical-offset base, string columns, and ordering.

Part of OSS-1322.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The v2 file writer advanced every column from a single global row counter,
so a single file could only hold columns of equal length. Sparse data
overlay files need columns whose item counts differ within one file (each
field covers a different set of rows).

Add `FileWriter::write_columns`, which writes a set of `(field, array)`
pairs and advances each field's row counter independently, leaving other
fields untouched. A field never written ends up as a zero-length column.
`write_batch` is unchanged: it still advances all fields together, so
ordinary rectangular files round-trip exactly as before.

Per-column lengths were already derivable from page metadata; expose them
via `FileReader::column_num_rows`. The reader already schedules each column
from its own pages, so reading a column at its own length and random access
within it work without further changes.

Part of the Data Overlay Files feature (OSS-1323).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the overlay cell-resolution core into reads so `take` (and scan)
return merged values. `FileFragment::open` loads each projected field's
overlay value columns newest-first, and `new_read_impl`/`read_ranges`
merge them into base batches by physical offset before deletion
filtering — so resolution is identical for take and scan, NULL overrides
apply, deletions take precedence (an overlay value for a deleted row is
dropped with the row), and fields resolve independently.

The merge addresses each row by its physical offset (via
`ReadBatchParams::to_offsets_total`), so the resolution core now takes
explicit per-row offsets instead of a contiguous start — a single code
path for the contiguous scan range and arbitrary take indices. Sparse
per-field overlays read each field's value column independently, so
unequal-length columns (OSS-1323) are handled without materializing a
rectangular batch.

Removes the temporary "overlays not supported" guard from OSS-1322.

Tests: take of covered/uncovered offsets, multiple overlays
(newest-wins), per-field coverage with unequal-length columns, NULL
override, overlay on a deleted row (inert), and multi-fragment scan —
all over v2.0 and v2.1 files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request labels Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant