Skip to content

.NET: Fix issue with resuming checkpoint after package version upgrade#6636

Open
peibekwe wants to merge 1 commit into
mainfrom
peibekwe/workflow-version-resume
Open

.NET: Fix issue with resuming checkpoint after package version upgrade#6636
peibekwe wants to merge 1 commit into
mainfrom
peibekwe/workflow-version-resume

Conversation

@peibekwe

Copy link
Copy Markdown
Contributor

Motivation & Context

When you upgrade the  Microsoft.Agents.AI.Workflows package and redeploy, any saved workflow checkpoint from the previous version becomes unusable. The framework refuses to load it with a generic "checkpoint is not compatible" error, even though nothing about the workflow actually changed; only the package's version number did.

Hosts that persist checkpoints across deploys lose all in-flight work on every upgrade.

Description & Review Guide

With this fix, the framework no longer cares about version, culture, or strong-name signing info when matching types. It compares the type's name and the simple assembly's name (like Microsoft.Agents.AI.Workflows), and ignores everything else.

This fix is applied in four places that all had the same underlying bug:

  1. Checkpoint validation. The "is this checkpoint compatible with this workflow?" check at restore time.
  2. Type-equality lookups. Lots of internal dictionaries and sets are keyed by these type identifiers - message routing, executor-protocol checks, value identity. All of them now match across versions too.
  3. Generic types. Things like List<ChatMessage> embed the version of  Microsoft.Extensions.AI.Abstractions  inside the type name itself. We strip that version info before comparing, so an upgrade of that dependency doesn't break things either.
  4. Envelope type lookup. A separate spot that resolves a stored type name back into a Type object (for unwrapping declarative external requests) had its own copy of the same bug. Fixed and made consistent with the other path.

Fixes #6466

Contribution Checklist

  • The code builds clean without any errors or warnings
  • All unit tests pass, and I have added new tests where possible
  • The PR follows the Contribution Guidelines
  • This PR is linked to an issue and there is no other open PR for this issue (see Related Issue above).
  • This is not a breaking change. If it is a breaking change, add the breaking change label (or add "[BREAKING]" to the title prefix, before or after any language prefix) — a workflow keeps the label and title prefix in sync automatically.

@peibekwe peibekwe self-assigned this Jun 19, 2026
Copilot AI review requested due to automatic review settings June 19, 2026 20:37
@moonbox3 moonbox3 added .NET Issues related to the .NET codebase workflows Related to Workflows in agent-framework labels Jun 19, 2026
@github-actions github-actions Bot changed the title Fix issue with resuming checkpoint after package version upgrade .NET: Fix issue with resuming checkpoint after package version upgrade Jun 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses workflow checkpoint restore failures across SDK/package upgrades by making type identity comparisons version-tolerant (simple assembly name + type full name) and by adding a lenient TypeId → Type resolution path that ignores assembly version/culture/strong-name details. This improves checkpoint portability across redeploys where only assembly versions change.

Changes:

  • Update TypeId equality, hashing, and matching to ignore version/culture/public key token (including those embedded in generic argument type names).
  • Update WorkflowSession envelope type resolution to bind types by simple assembly name (version-tolerant) and cache results.
  • Add unit tests covering TypeId version tolerance, lenient type resolution, and end-to-end checkpoint resume after version rewrites.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/TypeId.cs Implements version-tolerant TypeId matching and normalization of generic argument assembly qualifiers.
dotnet/src/Microsoft.Agents.AI.Workflows/WorkflowSession.cs Switches envelope type lookup to a lenient resolver and introduces a cache for resolved types.
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointVersionToleranceTests.cs Adds an end-to-end test validating checkpoint restore after mutating persisted Version= strings.
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/TypeIdVersionToleranceTests.cs Adds focused unit tests for TypeId equality/matching/hash semantics across version changes.
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/WorkflowSessionResolveTypeLenientTests.cs Adds tests verifying ResolveTypeLenient behavior for version-mutated and generic type ids.

Comment thread dotnet/src/Microsoft.Agents.AI.Workflows/WorkflowSession.cs

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 5 | Confidence: 90%

✓ Correctness

This PR correctly fixes checkpoint compatibility across package version upgrades. The TypeId equality/hashing changes are consistent, the regex for stripping version qualifiers from generic type arguments is correct, the GetSimpleAssemblyName fallback is robust, and ResolveTypeLenient correctly leverages partial-name binding. No correctness issues found.

✓ Security Reliability

This PR relaxes TypeId matching to ignore assembly version, culture, and public key token, allowing checkpoint resumption after package upgrades. The implementation is sound: the regex pattern is linear (no ReDoS risk), the lazy field initialization is thread-safe for deterministic reference-type computations, Type.GetType results are gated by the existing IExternalRequestEnvelope assignability check, and the exception filter in GetSimpleAssemblyName covers the documented throws from AssemblyName parsing. The static ConcurrentDictionary cache in ResolveTypeLenient grows unbounded but is keyed by TypeId equality (simple name + normalized type), which naturally limits cardinality to the set of distinct envelope types in the application. No high-severity security or reliability issues found.

✓ Test Coverage

The test coverage for this PR is comprehensive. Three new test classes cover the four fix areas: TypeIdVersionToleranceTests covers equality, hashing, IsMatch, NormalizeTypeName, and dictionary/set lookup across versions (including generic arguments); WorkflowSessionResolveTypeLenientTests covers the new ResolveTypeLenient method with version mutations and unknown types; CheckpointVersionToleranceTests provides a full integration test verifying end-to-end checkpoint resume after version rewriting. The tests use meaningful assertions (not just 'no exception') and verify both positive and negative cases. The only minor gap is the lack of a direct unit test for GetSimpleAssemblyName's edge cases, though it is thoroughly exercised indirectly.

✓ Failure Modes

The PR correctly implements version-tolerant type matching for checkpoint resumption. The TypeId equality/hash changes are internally consistent (SimpleAssemblyName for hashing, normalized type name comparison). The regex-based normalization handles standard .NET assembly-qualified name formats. GetSimpleAssemblyName has a robust fallback for malformed inputs. The static ConcurrentDictionary cache in ResolveTypeLenient may permanently cache null for unresolvable types, but the caller already handles null gracefully. No concrete silent-failure or data-loss paths identified.

✗ Design Approach

I found one design gap in the new version-tolerance approach: the implementation only normalizes embedded generic assembly qualifiers when they appear as the exact Version + Culture + PublicKeyToken triplet in that order. That leaves the change narrower than its stated contract and means some persisted TypeId shapes will still fail equality/matching/type resolution even though the PR says version, culture, and signing metadata are all ignored.

Flagged Issues

  • TypeId.NormalizeTypeName only strips the exact , Version=..., Culture=..., PublicKeyToken=... sequence, so embedded generic type names that carry only one of those qualifiers or list them in a different order are still treated as different types. Because TypeId.Equals/GetHashCode depend on NormalizedTypeName (TypeId.cs:69-89) and WorkflowSession.ResolveTypeLenient also uses it (WorkflowSession.cs:333-335), the new compatibility story is incomplete for those inputs.

Automated review by peibekwe's agents

@github-actions

Copy link
Copy Markdown
Contributor

Flagged issue

TypeId.NormalizeTypeName only strips the exact , Version=..., Culture=..., PublicKeyToken=... sequence, so embedded generic type names that carry only one of those qualifiers or list them in a different order are still treated as different types. Because TypeId.Equals/GetHashCode depend on NormalizedTypeName (TypeId.cs:69-89) and WorkflowSession.ResolveTypeLenient also uses it (WorkflowSession.cs:333-335), the new compatibility story is incomplete for those inputs.


Source: automated DevFlow PR review

@peibekwe

Copy link
Copy Markdown
Contributor Author

Flagged issue

TypeId.NormalizeTypeName only strips the exact , Version=..., Culture=..., PublicKeyToken=... sequence, so embedded generic type names that carry only one of those qualifiers or list them in a different order are still treated as different types. Because TypeId.Equals/GetHashCode depend on NormalizedTypeName (TypeId.cs:69-89) and WorkflowSession.ResolveTypeLenient also uses it (WorkflowSession.cs:333-335), the new compatibility story is incomplete for those inputs.

Source: automated DevFlow PR review

The CLR maintains strict rule and always produces the format: [Name], Version=[V], Culture=[C], PublicKeyToken=[T]

@peibekwe peibekwe marked this pull request as ready for review June 19, 2026 21:13

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 5 | Confidence: 91% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Failure Modes, Design Approach


Automated review by peibekwe's agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

.NET Issues related to the .NET codebase workflows Related to Workflows in agent-framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

.NET: [Bug]: Workflow checkpoints are not restorable across SDK upgrades — TypeId uses Assembly.FullName (incl. version) for executor type matching

4 participants