Skip to content

feat(rust): support version archive#6049

Open
majin1102 wants to merge 8 commits intolance-format:mainfrom
majin1102:archive_versions
Open

feat(rust): support version archive#6049
majin1102 wants to merge 8 commits intolance-format:mainfrom
majin1102:archive_versions

Conversation

@majin1102
Copy link
Contributor

Desgin: #5998

@github-actions github-actions bot added the enhancement New feature or request label Feb 27, 2026
@github-actions
Copy link
Contributor

Review: feat(rust): support version archive

P0: inspection.clone() is an expensive full copy of all referenced file paths

In cleanup.rs, the run() method now clones the entire CleanupInspection before passing it to delete_unreferenced_files:

let stats = self.delete_unreferenced_files(inspection.clone()).await?;
...
self.archive_versions(&mut inspection).await?;

CleanupInspection contains referenced_files and verified_files, which hold HashSet<Path> for every data file, delete file, index, and transaction path in the dataset. For a large dataset with hundreds of thousands of files, this clone is a significant and unnecessary memory/CPU cost during cleanup.

Suggested fix: Extract the archive and old_manifests before consuming inspection. For example, wrap version_archive in Option<VersionArchive> and .take() it, or restructure to call archive_versions with the extracted data rather than the full inspection.

P1: read_transaction_by_version called for every manifest unconditionally

In process_manifest_file, the transaction is read for every manifest file before checking whether the version is newer than the archive's latest:

let transaction = self
    .dataset
    .read_transaction_by_version(manifest.version)
    .await
    .ok()
    .flatten();

let mut inspection = inspection.lock().unwrap();

if manifest.version > inspection.version_archive.latest_version_number {
    // ... use transaction
}

This adds an extra I/O round-trip per manifest during cleanup, even for versions already archived. Consider taking the lock briefly to check the version, then reading the transaction only when needed:

let needs_transaction = {
    let inspection = inspection.lock().unwrap();
    manifest.version > inspection.version_archive.latest_version_number
};

let transaction = if needs_transaction {
    self.dataset.read_transaction_by_version(manifest.version).await.ok().flatten()
} else {
    None
};

P1: is_cleaned_up is set redundantly in two places

is_cleaned_up is set on version_summaries in run(), then the summaries are added to the archive, and then archive_versions() iterates over version_archive.versions again to set the same flag. The logic works but is confusing - consider consolidating to one location (just archive_versions).

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 96.73025% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/archive.rs 96.39% 6 Missing and 11 partials ⚠️
rust/lance/src/dataset/cleanup.rs 97.84% 3 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs 93.54% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant