feat: expose tracked_files and all_files on LanceDataset by wjones127 · Pull Request #6011 · lance-format/lance

wjones127 · 2026-02-25T16:36:59Z

Adds new tracked_files() and all_files() methods that return data about files in a table. Both return as Arrow data.

tracked_files() outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns for base_uri, version, path, and file_type.

all_files() outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns for base_uri, path, file_size, last_modified.

These two data streams can be used in combination to do deeper analysis on file structure of a table. It can answer questions like: How much of the storage space is taken up by untracked files? When were untracked files created? Which files are taking up the most space? How big is version X?

Adds two new public methods on `Dataset` in a new `dataset/files` module: - `tracked_files()`: returns one row per (version, file) for every file referenced across all manifests, with columns `version`, `base_uri`, `path`, and `type` (data file / manifest / deletion file / transaction file / index file). - `all_files()`: returns one row per physical file under the dataset root, with columns `base_uri`, `path`, `size_bytes`, and `last_modified`. Both return `SendableRecordBatchStream` and use dictionary encoding on repeated string columns. `tracked_files` processes manifests concurrently via `buffer_unordered` and handles external `base_paths`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace schema fns with `static LazyLock<SchemaRef>` so schema objects are created once and shared. - Add `FileRow` struct and `TrackedFileBatch::with_capacity` + `extend`, pre-sizing Arrow buffers based on the per-manifest row count. - `tracked_files`: Phase 1 spawns a task that drives `buffer_unordered` manifest processing, streams non-index batches via `mpsc`, and sends the UUID→versions map via `oneshot` when done; Phase 2 is driven by a `try_unfold` state machine on the caller side that receives the map from the oneshot then eagerly lists each index-UUID directory. - `all_files`: spawn a task that drives `read_dir_all` and sends pre-sized batches via `mpsc`; caller side wraps the receiver with `try_unfold` for error-propagating lazy streaming. - Both streams now use `stream::try_unfold` so errors propagate immediately without collecting all data into memory first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds two new Dataset methods: - `tracked_files()`: streams one row per (version, file) for every file referenced across all manifest versions, including data, deletion, transaction, manifest, and index files. - `all_files()`: streams one row per file physically present at the dataset's base URI with size and last-modified metadata. Index UUID directories are listed in parallel via `buffer_unordered` and cached across manifest versions to avoid redundant listings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously, errors from `?` inside the spawned task (e.g. a network error reading a manifest) would be stored in the dropped JoinHandle and silently discarded, causing the stream to end with no data and no error signal. Now the inner logic runs as a typed Result block and any failure is forwarded to the channel before the task exits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds PyO3 bindings for both Rust methods and Python wrappers with docstrings. Returns a `pa.RecordBatchReader` in both cases.

github-actions · 2026-02-25T16:38:14Z

PR Review

P1 Issues

1. Accidental test assertion removal

The diff removes an existing assertion from test_default_scan_options_nearest:

-    assert "id" in result.column_names

This appears unintentional - the assertion belongs to the previous test and should be preserved.

2. Documentation mismatch in tracked_files() Rust doc comment

The doc comment at rust/lance/src/dataset/files.rs:373 states:

| `type` | `Dictionary(Int32, Utf8)` (non-null) | ...

But the actual schema uses Int8 for the dictionary key:

DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8))

The doc should say Dictionary(Int8, Utf8).

Otherwise, the implementation looks solid with good test coverage, efficient batching, and proper error propagation through the channel pattern.

codecov · 2026-02-25T17:31:34Z

Codecov Report

❌ Patch coverage is 92.92731% with 36 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/files.rs	91.50%	23 Missing and 13 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127 · 2026-02-25T21:56:00Z

There will be some follow ups to make this useful:

Add progress reporting (re-use the index progress stuff)
Optimize speed of listing all files
Optimize speed of listing tracked files

wjones127 and others added 7 commits February 23, 2026 15:18

wip

9b03205

wip

a304d34

feat(python): expose tracked_files and all_files on LanceDataset

677e7f4

Adds PyO3 bindings for both Rust methods and Python wrappers with docstrings. Returns a `pa.RecordBatchReader` in both cases.

github-actions bot added enhancement New feature or request python labels Feb 25, 2026

wjones127 changed the title ~~feat(python): expose tracked_files and all_files on LanceDataset~~ feat: expose tracked_files and all_files on LanceDataset Feb 25, 2026

wjones127 marked this pull request as ready for review February 25, 2026 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose tracked_files and all_files on LanceDataset#6011

feat: expose tracked_files and all_files on LanceDataset#6011
wjones127 wants to merge 7 commits intolance-format:mainfrom
wjones127:feat/dataset-file-inspection-apis

wjones127 commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

codecov bot commented Feb 25, 2026

Uh oh!

wjones127 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjones127 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2026

PR Review

P1 Issues

Uh oh!

codecov bot commented Feb 25, 2026

Codecov Report

Uh oh!

wjones127 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wjones127 commented Feb 25, 2026 •

edited

Loading