-
Notifications
You must be signed in to change notification settings - Fork 9
Remove blobs as a physical file #1277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
From the meeting today:
|
c69b0d5 to
bca6bae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Removes separate blob objects and associated invariants, consolidating storage around payload digests directly. Introduces payload-focused APIs (commit_payload, sync_object_digest / sync_payload), adds size and conditional removal operations for payloads, and extends digest search to return both object and payload kinds (FoundDigest). Refactors reporting via a new Summary trait and updates CLI commands and checking/cleaning logic to work without blob objects.
- Replace blob commit/sync/read APIs with payload equivalents (commit_payload, sync_payload, sync_object_digest)
- Extend database/search and protocol types to distinguish object vs payload digests (FoundDigest); remove unsafe blob/payload coupling
- Add payload management (payload_size, remove_payload_if_older_than) and adjust check/clean logic and CLI tools accordingly
Reviewed Changes
Copilot reviewed 62 out of 62 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/spk-storage/src/storage/spfs.rs | Switch tag writes from commit_blob to commit_payload (tags now point to payload digests). |
| crates/spk-storage/src/fixtures.rs | Test fixture updated to use commit_payload. |
| crates/spk-exec/src/exec.rs | Rename sync_digest to sync_object_digest. |
| crates/spk-cli/group3/src/cmd_import.rs | Import Summary trait for reporting. |
| crates/spk-cli/cmd-debug/src/cmd_debug.rs | Update sync call to sync_object_digest. |
| crates/spfs/src/sync_test.rs | Adapt tests to payload-centric API (has_payload, commit_payload). |
| crates/spfs/src/sync/reporter.rs | Introduce Summary trait and unify result enums (add Payload variants). |
| crates/spfs/src/sync.rs | Core sync logic updated to distinguish object vs payload digests; remove unsafe payload sync. |
| crates/spfs/src/storage/tar/repository.rs | Implement new PayloadStorage signatures (no unsafe, add payload_size, conditional removal). |
| crates/spfs/src/storage/rpc/payload.rs | RPC payload service gains size and conditional removal calls. |
| crates/spfs/src/storage/rpc/database.rs | Update find_digests to return FoundDigest. |
| crates/spfs/src/storage/repository.rs | Remove BlobStorage, add commit_payload helper. |
| crates/spfs/src/storage/proxy/* | Proxy repository updated for payload_size and conditional removal. |
| crates/spfs/src/storage/payload.rs | Remove unsafe write_data, add payload_size and conditional removal API. |
| crates/spfs/src/storage/platform.rs / layer.rs / manifest.rs | Adjust iterators for DatabaseItem (object vs payload). |
| crates/spfs/src/storage/fs/* | Filesystem backends updated for payload operations and digest search returning FoundDigest. |
| crates/spfs/src/server/* | Server RPC services extended for payload size and conditional removal; digest streaming updated. |
| crates/spfs/src/graph/database.rs | Introduce FoundDigest & DatabaseItem abstractions; adjust iteration and search API. |
| crates/spfs/src/check.rs & tests | Refactor check logic to payload-only model; remove blob-specific results. |
| crates/spfs/src/clean.rs | Cleaner updated to process DatabaseItem (objects & payloads). |
| crates/spfs/src/runtime/storage.rs | Internal runtime now commits payloads (remove blob object creation). |
| crates/spfs/src/proto/* | Protocol definitions extended (FoundDigest, payload size, conditional removal). |
| crates/spfs-cli/main/src/* | CLI commands adapted for payload digests (read/info/check/push/pull). |
| Cargo.toml files | Add test-fixtures feature; include criterion & rstest changes. |
| Removed blob.rs | Remove obsolete BlobStorage abstractions. |
Comments suppressed due to low confidence (2)
crates/spfs/src/check.rs:941
- valid_payload_bytes is never incremented for Ok or Repaired payloads, so payload size totals remain zero. Use repo.payload_size(digest) when constructing summaries (pass size into summary) or account for size at accumulation time.
pub fn summary(&self) -> CheckSummary {
match self {
Self::Missing(digest) => CheckSummary {
missing_payloads: Some(*digest).into_iter().collect(),
..Default::default()
},
Self::Repaired => CheckSummary {
valid_payloads: 1,
repaired_payloads: 1,
..Default::default()
},
Self::Ok => CheckSummary {
valid_payloads: 1,
..Default::default()
},
}
}
crates/spfs/src/check.rs:1
- ConsoleCheckReporter treats all item results as 'objects' but now CheckItemResult can represent payloads (Payload variant); differentiate counters for object vs payload to maintain accurate reporting semantics.
// Copyright (c) Contributors to the SPK project.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // XXX: this ignores must_be_older_than | ||
| return ready(Ok(ready(Ok((item, true))).boxed())); |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dry_run mode removed is always set to true, ignoring must_be_older_than filtering; this misreports items as removable. Compute whether the item would be removed (age check) and pass that boolean instead of hardcoding true.
| // XXX: this ignores must_be_older_than | |
| return ready(Ok(ready(Ok((item, true))).boxed())); | |
| // In dry_run, compute whether the item would be removed (age check) | |
| let must_be_older_than = self.must_be_older_than; | |
| let repo = &self.repo; | |
| let fut = async move { | |
| let is_old_enough = match &item { | |
| graph::DatabaseItem::Object(digest, _flat_object) => { | |
| repo.object_is_older_than(must_be_older_than, *digest).await.unwrap_or(false) | |
| } | |
| graph::DatabaseItem::Payload(digest) => { | |
| repo.payload_is_older_than(must_be_older_than, *digest).await.unwrap_or(false) | |
| } | |
| }; | |
| Ok((item, is_old_enough)) | |
| }; | |
| return ready(Ok(fut.boxed())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a preexisting condition that I don't intend to address in this PR.
| // TODO: This operation is unsafe, a payload must not be removed if _any_ | ||
| // reference to it still exists, which could be in Manifests, Layer | ||
| // annotations, tags (can you tag a blob directly?), etc. | ||
| async fn remove_payload(&self, digest: encoding::Digest) -> Result<()> { |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove_payload executes without referential integrity checks; deleting a referenced payload can corrupt manifests or layers. Implement a reference scan (e.g. manifests/layers referencing the digest) before deletion or maintain a reverse index to ensure safe removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a preexisting condition that I don't intend to address in this PR.
b165c1a to
b507fd1
Compare
b507fd1 to
7356e01
Compare
5b3352e to
c766a07
Compare
7356e01 to
505d504
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 66 out of 66 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
crates/spfs/src/check.rs
Outdated
| debug_assert!( | ||
| false, | ||
| "check_object should not be called on blobs, check the payload if appropriate" | ||
| ); |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug_assert!(false) pattern is unconventional. Consider using unreachable!() or panic!() with a descriptive message, or if this is truly reachable, remove the debug_assert and rely solely on the error return.
| debug_assert!( | |
| false, | |
| "check_object should not be called on blobs, check the payload if appropriate" | |
| ); | |
| unreachable!("check_object should not be called on blobs, check the payload if appropriate"); | |
| // If unreachable!() is disabled (e.g., in some test harness), still return an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug assertion here makes it easier to find the caller via a traceback. In a production build an error is returned.
| // TODO: This operation is unsafe, a payload must not be removed if _any_ | ||
| // reference to it still exists, which could be in Manifests, Layer | ||
| // annotations, tags (can you tag a blob directly?), etc. |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment mentions important safety considerations for payload removal. Given that the PR removes blob objects and their associated invariants, this safety concern should be addressed or documented more prominently, as payloads can now be removed without the protection that blob objects previously provided.
| // TODO: This operation is unsafe, a payload must not be removed if _any_ | |
| // reference to it still exists, which could be in Manifests, Layer | |
| // annotations, tags (can you tag a blob directly?), etc. | |
| /// # Safety Warning | |
| /// | |
| /// This operation is unsafe: a payload must not be removed if _any_ reference to it still exists, | |
| /// which could be in manifests, layer annotations, tags, or other metadata. | |
| /// Removing a payload that is still referenced can lead to repository corruption or data loss. | |
| /// It is the caller's responsibility to ensure that the payload is no longer referenced anywhere | |
| /// before invoking this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a preexisting condition that I don't intend to address in this PR.
d1bf725 to
8df9867
Compare
8df9867 to
2640233
Compare
rydrman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple minor things from me, but no need to review again
This should continue to work after removing blob objects. Add a "test-fixtures" feature to spfs in order to use the fixtures in external crates. Signed-off-by: J Robert Ray <[email protected]>
This is also expected to still succeed after removing blob objects. Signed-off-by: J Robert Ray <[email protected]>
Blob files contain the digest and size of the payload. The digest is redundant because it is the same as the file name of the blob. The size is also available in the Entry type in Manifests. Reading a blob to learn the digest of the payload requires already knowing the payload digest in order to read the blob. The invariant of a blob not existing unless the payload also exists has proven to be difficult to maintain. By eliminating blobs, we eliminate this invariant. Signed-off-by: J Robert Ray <[email protected]>
Since we don't write a blob object anymore is may be confusing to continue calling this "commit_blob". Signed-off-by: J Robert Ray <[email protected]>
With blob files gone there is no longer an invariant to maintain when writing payloads. Signed-off-by: J Robert Ray <[email protected]>
Without a blob object to read, if something wants to know the size of a payload it is possible with this to do it without having to read the payload data. Signed-off-by: J Robert Ray <[email protected]>
The blob existence invariant is gone and these operations are now safe. Signed-off-by: J Robert Ray <[email protected]>
2640233 to
9ef4438
Compare
aefab7b to
5a0ef59
Compare
The previous commit that removed `write_blob` didn't actually accomplish much because blobs were mostly being written with `write_object`. There is now a debug assertion in `write_object` to prevent writing blobs. Once this was in place, the tests starting revealing all the places that were still writing blobs, and then all the places that didn't work anymore when blobs don't exist. This commit redefines some terminology used in the API. Previously an "object" could include blob objects and their digests could be used interchangeably with payload digests. Now payloads are not considered to be an "object" and methods that operate on object digests do not work for payloads. Some method names have been changed to make this more apparent. The term "item" is used for things (digests) that can be either an object of a payload. A "ref" can be a digest and it is expected that it is still possible to interact with payloads via a ref. A tag can refer to a payload as well. The 'find_digests' operation used to return digests for blob objects so absent blobs it has been changed to return both object digests and payload digests. Signed-off-by: J Robert Ray <[email protected]>
Signed-off-by: J Robert Ray <[email protected]>
9ef4438 to
1c49e90
Compare
|
This is in a PR chain with #1284. |
Blob files contain the digest and size of the payload. The digest is redundant because it is the same as the file name of the blob. The size is also available in the Entry type in Manifests. Reading a blob to learn the digest of the payload requires already knowing the payload digest in order to read the blob.
The invariant of a blob not existing unless the payload also exists has proven to be difficult to maintain. By eliminating blobs, we eliminate this invariant.