Skip to content

feat: expose tracked_files and all_files on LanceDataset#6011

Open
wjones127 wants to merge 7 commits intolance-format:mainfrom
wjones127:feat/dataset-file-inspection-apis
Open

feat: expose tracked_files and all_files on LanceDataset#6011
wjones127 wants to merge 7 commits intolance-format:mainfrom
wjones127:feat/dataset-file-inspection-apis

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Feb 25, 2026

Adds new tracked_files() and all_files() methods that return data about files in a table. Both return as Arrow data.

tracked_files() outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns for base_uri, version, path, and file_type.

all_files() outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns for base_uri, path, file_size, last_modified.

These two data streams can be used in combination to do deeper analysis on file structure of a table. It can answer questions like: How much of the storage space is taken up by untracked files? When were untracked files created? Which files are taking up the most space? How big is version X?

wjones127 and others added 7 commits February 23, 2026 15:18
Adds two new public methods on `Dataset` in a new `dataset/files` module:

- `tracked_files()`: returns one row per (version, file) for every file
  referenced across all manifests, with columns `version`, `base_uri`,
  `path`, and `type` (data file / manifest / deletion file / transaction
  file / index file).
- `all_files()`: returns one row per physical file under the dataset root,
  with columns `base_uri`, `path`, `size_bytes`, and `last_modified`.

Both return `SendableRecordBatchStream` and use dictionary encoding on
repeated string columns. `tracked_files` processes manifests concurrently
via `buffer_unordered` and handles external `base_paths`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace schema fns with `static LazyLock<SchemaRef>` so schema objects
  are created once and shared.
- Add `FileRow` struct and `TrackedFileBatch::with_capacity` + `extend`,
  pre-sizing Arrow buffers based on the per-manifest row count.
- `tracked_files`: Phase 1 spawns a task that drives `buffer_unordered`
  manifest processing, streams non-index batches via `mpsc`, and sends
  the UUID→versions map via `oneshot` when done; Phase 2 is driven by
  a `try_unfold` state machine on the caller side that receives the map
  from the oneshot then eagerly lists each index-UUID directory.
- `all_files`: spawn a task that drives `read_dir_all` and sends
  pre-sized batches via `mpsc`; caller side wraps the receiver with
  `try_unfold` for error-propagating lazy streaming.
- Both streams now use `stream::try_unfold` so errors propagate
  immediately without collecting all data into memory first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds two new Dataset methods:
- `tracked_files()`: streams one row per (version, file) for every file
  referenced across all manifest versions, including data, deletion,
  transaction, manifest, and index files.
- `all_files()`: streams one row per file physically present at the
  dataset's base URI with size and last-modified metadata.

Index UUID directories are listed in parallel via `buffer_unordered` and
cached across manifest versions to avoid redundant listings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, errors from `?` inside the spawned task (e.g. a network
error reading a manifest) would be stored in the dropped JoinHandle and
silently discarded, causing the stream to end with no data and no error
signal. Now the inner logic runs as a typed Result block and any failure
is forwarded to the channel before the task exits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds PyO3 bindings for both Rust methods and Python wrappers with
docstrings. Returns a `pa.RecordBatchReader` in both cases.
@github-actions github-actions bot added enhancement New feature or request python labels Feb 25, 2026
@github-actions
Copy link
Contributor

PR Review

P1 Issues

1. Accidental test assertion removal

The diff removes an existing assertion from test_default_scan_options_nearest:

-    assert "id" in result.column_names

This appears unintentional - the assertion belongs to the previous test and should be preserved.

2. Documentation mismatch in tracked_files() Rust doc comment

The doc comment at rust/lance/src/dataset/files.rs:373 states:

| `type` | `Dictionary(Int32, Utf8)` (non-null) | ...

But the actual schema uses Int8 for the dictionary key:

DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8))

The doc should say Dictionary(Int8, Utf8).


Otherwise, the implementation looks solid with good test coverage, efficient batching, and proper error propagation through the channel pattern.

@wjones127 wjones127 changed the title feat(python): expose tracked_files and all_files on LanceDataset feat: expose tracked_files and all_files on LanceDataset Feb 25, 2026
@codecov
Copy link

codecov bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 92.92731% with 36 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/files.rs 91.50% 23 Missing and 13 partials ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 marked this pull request as ready for review February 25, 2026 21:53
@wjones127
Copy link
Contributor Author

There will be some follow ups to make this useful:

  • Add progress reporting (re-use the index progress stuff)
  • Optimize speed of listing all files
  • Optimize speed of listing tracked files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant