Refactor BlobManager tracking, support getPendingBlobs and abort for payload pending blobs #25642

ChumpChief · 2025-10-08T01:29:11Z

This PR is a major refactor of BlobManager's tracking of pending blobs. This results in unified tracking between legacy/payloadPending flows. Overall it's simpler and more airtight, e.g. fixing a bug that we were resubmitting BlobAttach ops for aborted blobs. Abort support is added for payloadPending blobs. getPendingBlobs() is now fully supported. Loading with pending blobs for read-only purposes is now supported (e.g. for frozen container scenarios), as well as then proceeding to resume sharing those blobs via the new sharePendingBlobs() method (e.g. for restoring after a tab close).

I've added lots of comments in the code that should explain everything, but at a high level the meat of it is uploadAndAttach() and the localBlobCache structure. It is called from both legacy/payloadPending flows, and in both normal creation and sharePendingBlobs() flows. It has two subroutines ensureUploaded and tryAttach which advance pending blobs' state in a semi-linear fashion (progress can move backwards in the case of TTL expiration). Errors and aborts throw out of the function to either be handled directly by the customer in legacy flows, or to update BlobHandle state in payloadPending flows. It leans heavily on internal eventing to detect external state changes that affect the create flow, so that the main path can be simple and well-contained in uploadAndAttach().

This change updates the already-enabled tests to reflect these changes and add coverage for abort, but does not add coverage for the new getPendingBlobs scenarios. Open to opinions on whether we think that additional coverage should be a prerequisite for checkin or not (since it is still an experimental feature).

…rypoint

Copilot

Pull Request Overview

This PR refactors BlobManager's pending blob tracking system to unify the handling of legacy and payload-pending blob creation flows. The refactoring introduces a state-machine approach for tracking blobs through their lifecycle from creation to attachment, enabling better support for features like getPendingBlobs and abort functionality across both flows.

Replaces the pendingBlobs map with localBlobCache using explicit state tracking (localOnly, uploading, uploaded, attaching, attached)
Consolidates blob upload/attach logic into a single uploadAndAttach() method with retry capabilities for TTL expiration
Adds support for abort signals in payload-pending flows and fixes a bug where aborted blobs would still be resubmitted

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`blobManager.ts`	Core refactoring with new state machine and unified `uploadAndAttach()` method
`containerRuntime.ts`	Updates parameter name from `stashedBlobs` to `pendingBlobs`
`blobManager.spec.ts`	Updates abort tests to work with both legacy and payload-pending flows
`blobHandles.spec.ts`	Parameter name update and test behavior adjustment for local blob caching
`getPendingBlobs.spec.ts`	Implements actual test assertions for `getPendingBlobs` functionality
`blobTestUtils.ts`	Parameter name update in test utilities

packages/runtime/container-runtime/src/blobManager/blobManager.ts

dannimad · 2025-10-08T16:30:34Z

packages/runtime/container-runtime/src/blobManager/blobManager.ts

+					localBlobRecord.state === "attaching",
+				"Pending blob must be in uploading, uploaded, or attaching state",
+			);
+			pendingBlobs[localId] =


what happens with attaching blobs?

Blobs in "uploaded" or "attaching" state are treated the same - we only trust that we have uploaded it to storage and make no assumption about whether a BlobAttach op was sent for them (we're just ready to handle a BlobAttach ack if it shows up). I'll add a code comment here too.

dannimad · 2025-10-08T16:34:28Z

packages/runtime/container-runtime/src/blobManager/blobManager.ts

+	 * @returns A promise that resolves when all the uploads and attaches have completed, or rejects
+	 * if any of them fail.
+	 */
+	public readonly sharePendingBlobs = async (): Promise<void> => {


this is not being used anywhere?

Correct - this is new functionality that doesn't yet have an entrypoint. It should be conditionally called only for non-frozen containers (not sure if we have a way to distinguish these from frozen containers currently, but that will be required before we can call it safely). There's one test that calls it, and I'll be adding more tests for it in my next change.

anthony-murphy · 2025-10-08T22:56:06Z

packages/runtime/container-runtime/src/blobManager/blobManager.ts

+/**
+ * Blobs that were created locally are tracked, and may be in one of these states. When first
+ * created, they are in localOnly state. The process of sharing has two steps, upload and attach.
+ * Progress through the stages may regress back to localOnly if we determine the storage may have
+ * deleted the blob before we could finish attaching it.


i'm confused by this, don't we attach first, and then upload in the payload pending case?

There are two things called "attach" here - attaching the handle, and the BlobAttach op. In pretty much every case here I'll be referring to the blob attach rather than the handle attach.

it seems like you don't want to change the terminology. if you decided to keep calling the op lifecycle attach can you can at least document it, as i think it will be confusing for anyone used to how attach is used in other contexts

Oh I'd love to change it lol - unfortunately I didn't name the BlobAttach op or SummaryType.Attachment and those are persisted. So I decided to go along with the existing naming rather than go on a renaming campaign. Happy to make some changes here though if you like.

In the meantime I'll add some more documentation.

Looks like maybe we've never loved the name "attachment":
#3974 (review)

Other early (~5 years ago) PRs have comments that sound confused between the concept of handle attachment vs. sending the BlobAttach op. So agree would be nice to rename if we get a chance.

anthony-murphy · 2025-10-08T23:07:16Z

packages/runtime/container-runtime/src/blobManager/blobManager.ts

+					...serializableBlobRecord,
+					blob: stringToBuffer(serializableBlobRecord.blob, "base64"),
+				};
+				this.localBlobCache.set(localId, localBlobRecord);


don't these need to go into pendingBlobsWithAttachedHandles? it's not really clear to me what you mean by attach. attach means the object is part of the persisted object graph via a stored reference to its handle. for payload pending i wouldn't consider the blob and object to have separate attach states, both are attached when the handle is stored, the object just might not be available yet

Good call, yes they need to go into pendingBlobsWithAttachedHandles to round trip them back out to pending state again if we call getPendingBlobs() before the pending blobs finish sharing. Adding in next.

Both legacy and payloadPending still have to complete the BlobAttach op, which happens separately from handle attach. So they are handle-attached when the handle is stored, and it's guaranteed that the legacy flow will also be blob-attached at that time, but payloadPending blobs will not be blob-attached at that time.

can you add a test for this case?

Yeah my next PR will be fleshing out all the getPendingBlobs tests

anthony-murphy · 2025-10-08T23:21:26Z

packages/runtime/container-runtime/src/blobManager/blobManager.ts

-		if (this.pendingBlobs.has(localId)) {
-			this.sendBlobAttachOp(localId, storageId);
+		const { localId, blobId: remoteId } = metadata;
+		// Any blob that we're actively trying to advance to attached state must be in attaching state.


i might be starting to understand, are you calling the blob op round trip attach? that feels pretty off from how its normally used. the blob op is really just sent to the server know about the blob, and so the summarizer references it in the next summary.

this is even more confusing for blob that are not payload pending, as just uploading and sending the op don't attach them to that object graph, only storing the handle does.

That's right re: terminology, but the BlobAttach op has more responsibilities than that - it's how remote clients learn about the existence of the blob and add it to their redirectTables.

Its ack is also when we know the TTL is no longer a threat and can exit monitoring/retry loop.

Handle attach is actually relatively unimportant (it just determines when things kick off, and which pending blobs go into getPendingBlobs()), whereas blob attach is super central the the main function of BlobManager.

anthony-murphy · 2025-10-08T23:49:09Z

packages/runtime/container-runtime/src/blobManager/blobManager.ts

-				};
-				this.pendingBlobs.set(localId, pendingEntry);
+				this.pendingBlobsWithAttachedHandles.add(localId);
+				const uploadAndAttachP = this.uploadAndAttach(localId, signal);


i don't think we should support aborting a blob upload after the handle is attached

this is especially true as the abort is kind of a lie anyway, we can't stop the upload. we only added abort for the case where closeAndGetPendingState was call, such that the app could signal which blobs not to wait for attachment, with payload pending that's not a problem.

I don't think it really hurts anything, as presumably anyone who would want to call abort is also deleting the handle from wherever they stored it. It's deterministic behavior that a customer can reason about, whereas "abort sometimes doesn't work when I call it" would just lead to mystery bugs on the customer side.

Really we should just add driver-level support for abort so we can actually abort the upload, which would simplify the logic here as well.

the possible harm is someone reusing an abort controller, and aborting after the handle is returned, and not realizing it also invalidates the handle. usually abort signals only abort things under the context of the methods they are passed to. whereas in this case the method will return, but the abort signal continues to be bound to internal processes that consumer isn't and shouldn't be aware of

The consumer is already aware of the handle's failed share status via payloadShareError and the relevant events. Aborting the signal updates that and emits in the same predictable way as for any other reason the share might fail.

ChumpChief added 23 commits October 7, 2025 16:40

Port and update prototype

1a3eb1a

Restore a couple things

84f4177

Restore storageId identity entry

169590b

Ensure we try to hit storage for readBlob error test case

f801c84

Drop extra props from uploadBlob abort error

f322563

Remove stale comments and unhelpful satisfies

f178d89

Support TTL, make upload/attach flow linear regardless of failure ent…

47b4ebd

…rypoint

Support abort

e01d120

Better promise/error handling

349ef35

Remove extra abort check

e2af35a

Support calling with uploaded blob

dc36218

Unify upload/attach flow

c9115f2

Reorganize using subroutines

37bf9c3

Rearrange abort checking, add comments

f60f11c

Pending handle cleanup, initial getPendingBlobs, add TODOs

1ba4174

Tiny rename

23757aa

Support attach ops arriving at surprising times

2aa7674

Fix some TODOs

1374bfa

Add sharePendingBlobs

1b63b94

Get first test working

1c7eae1

Bunch of comments, fix up abort flows

a226ff2

More comments

b099f82

Enable abort tests for payload pending

1dc3a02

Copilot AI review requested due to automatic review settings October 8, 2025 01:29

github-actions bot added area: runtime Runtime related issues base: main PRs targeted against main branch labels Oct 8, 2025

Copilot AI reviewed Oct 8, 2025

View reviewed changes

ChumpChief requested review from anthony-murphy and dannimad October 8, 2025 16:09

Remove stale comment

fe56e90

Remove attachAndGetPendingBlobs()

57c94fe

dannimad reviewed Oct 8, 2025

View reviewed changes

Add comment about treating uploaded and attaching blobs the same

b9bc987

anthony-murphy reviewed Oct 8, 2025

View reviewed changes

Add incoming pending blobs to pendingBlobsWithAttachedHandles

7379161

anthony-murphy approved these changes Oct 9, 2025

View reviewed changes

ChumpChief added 2 commits October 9, 2025 10:19

More docs, remoteId -> storageId

37b9c9b

Merge remote-tracking branch 'upstream/main' into BlobManagerInternals

9e25282

ChumpChief enabled auto-merge (squash) October 9, 2025 18:11

anthony-murphy approved these changes Oct 9, 2025

View reviewed changes

ChumpChief merged commit 16306cf into microsoft:main Oct 9, 2025
33 checks passed

ChumpChief deleted the BlobManagerInternals branch October 9, 2025 23:24

Refactor BlobManager tracking, support getPendingBlobs and abort for payload pending blobs #25642

Refactor BlobManager tracking, support getPendingBlobs and abort for payload pending blobs #25642

Uh oh!

Conversation

ChumpChief commented Oct 8, 2025 • edited by azure-boards bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthony-murphy Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthony-murphy Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthony-murphy Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChumpChief commented Oct 8, 2025 •

edited by azure-boards bot

Loading

anthony-murphy Oct 9, 2025 •

edited

Loading

anthony-murphy Oct 8, 2025 •

edited

Loading

anthony-murphy Oct 8, 2025 •

edited

Loading