feat: expose tracked_files and all_files on LanceDataset#6011
feat: expose tracked_files and all_files on LanceDataset#6011wjones127 wants to merge 7 commits intolance-format:mainfrom
Conversation
Adds two new public methods on `Dataset` in a new `dataset/files` module: - `tracked_files()`: returns one row per (version, file) for every file referenced across all manifests, with columns `version`, `base_uri`, `path`, and `type` (data file / manifest / deletion file / transaction file / index file). - `all_files()`: returns one row per physical file under the dataset root, with columns `base_uri`, `path`, `size_bytes`, and `last_modified`. Both return `SendableRecordBatchStream` and use dictionary encoding on repeated string columns. `tracked_files` processes manifests concurrently via `buffer_unordered` and handles external `base_paths`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace schema fns with `static LazyLock<SchemaRef>` so schema objects are created once and shared. - Add `FileRow` struct and `TrackedFileBatch::with_capacity` + `extend`, pre-sizing Arrow buffers based on the per-manifest row count. - `tracked_files`: Phase 1 spawns a task that drives `buffer_unordered` manifest processing, streams non-index batches via `mpsc`, and sends the UUID→versions map via `oneshot` when done; Phase 2 is driven by a `try_unfold` state machine on the caller side that receives the map from the oneshot then eagerly lists each index-UUID directory. - `all_files`: spawn a task that drives `read_dir_all` and sends pre-sized batches via `mpsc`; caller side wraps the receiver with `try_unfold` for error-propagating lazy streaming. - Both streams now use `stream::try_unfold` so errors propagate immediately without collecting all data into memory first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds two new Dataset methods: - `tracked_files()`: streams one row per (version, file) for every file referenced across all manifest versions, including data, deletion, transaction, manifest, and index files. - `all_files()`: streams one row per file physically present at the dataset's base URI with size and last-modified metadata. Index UUID directories are listed in parallel via `buffer_unordered` and cached across manifest versions to avoid redundant listings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, errors from `?` inside the spawned task (e.g. a network error reading a manifest) would be stored in the dropped JoinHandle and silently discarded, causing the stream to end with no data and no error signal. Now the inner logic runs as a typed Result block and any failure is forwarded to the channel before the task exits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds PyO3 bindings for both Rust methods and Python wrappers with docstrings. Returns a `pa.RecordBatchReader` in both cases.
PR ReviewP1 Issues1. Accidental test assertion removal The diff removes an existing assertion from - assert "id" in result.column_namesThis appears unintentional - the assertion belongs to the previous test and should be preserved. 2. Documentation mismatch in The doc comment at But the actual schema uses DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8))The doc should say Otherwise, the implementation looks solid with good test coverage, efficient batching, and proper error propagation through the channel pattern. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
There will be some follow ups to make this useful:
|
Adds new
tracked_files()andall_files()methods that return data about files in a table. Both return as Arrow data.tracked_files()outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns forbase_uri,version,path, andfile_type.all_files()outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns forbase_uri,path,file_size,last_modified.These two data streams can be used in combination to do deeper analysis on file structure of a table. It can answer questions like: How much of the storage space is taken up by untracked files? When were untracked files created? Which files are taking up the most space? How big is version X?