Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Error Type Accuracy in TaskFailedException FailureDetails with .NET-Isolated app #3070

Merged
merged 5 commits into from
Mar 25, 2025

Conversation

nytian
Copy link
Collaborator

@nytian nytian commented Mar 19, 2025

Issue describing the changes in this PR

partialy solves #2689

This PR ensures that when orchestrators catch exceptions, the FailureDetails.ErrorType will match the type of the original exception thrown by the activity or sub-orchestration function.

For example, if an activity function throws an InvalidOperationException, the FailureDetails.ErrorType in the resulting TaskFailedException should be "InvalidOperationException" .

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs
  • My changes do not add EventIds to our EventSource logs
    • Otherwise: Ensure the EventIds are within the supported range in our existing Windows infrastructure. You may validate this with a deployed app's telemetry. You may also extend the range by completing a PR such as this one.
  • My changes should be added to v2.x branch.
    • Otherwise: This change applies exclusively to WebJobs.Extensions.DurableTask v3.x. It will be retained only in the dev and main branches and will not be merged into the v2.x branch.

@nytian nytian requested review from cgillum and andystaples March 19, 2025 17:14
Copy link
Contributor

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@@ -11,14 +11,20 @@ internal class DurableSerializationException : Exception
// We set the base class properties of this exception to the same as the parent,
// so that methods in the worker after this can still (typically) access the same information vs w/o
// this exception type.
internal DurableSerializationException(Exception fromException) : base(fromException.Message, fromException.InnerException)
internal DurableSerializationException(Exception fromException) : base(CreateExceptionMessage(fromException), fromException.InnerException)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to understand why this change was made - the reason that I only did the JSON serialization in ToString() originally was because I didn't want to alter the actual properties of the exception - before this change, if the functions worker needed to access the properties of the user code exception, they were functionally unmodified, and it wasn't until the exception was converted to a string for gRPC transmission that the serialization happened.
Can you help me understand what was the negative effect of this decision and what does serializing the message property improve?

Copy link
Collaborator Author

@nytian nytian Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customers want more detailed information about exceptions within the FailureDetails of a TaskFailedEvent, including the specific error type. Currently, this detail is missing, and thus cx will always get unknown type inside FailureDetails.

Why this is missing, is because we only send the exception message at DurableSerializationException. For some reason, the ToString() is only called by grpc log, while the grpc worker only pass exception message and inner Failure. (Check class here)

Thus, when DurableSerializationException went to the OutOfProcMiddleware GetFailureDetails, we will fail for TryExtractSerializedFailureDetailsFromException and TrySplitExceptionTypeFromMessage, and cause the type becomes unknown.

I don't believe I've altered the original user exception; I'm only updating DurableSerializationException, which I assume is used for passing errors via gRPC, if I understand your question correctly. My main concern is that serializing the entire failure details might be excessive and redundant—for example, the error trace and inner exception will be included in the message while also being part of the full exception. However, since I suppose TryExtractSerializedFailureDetailsFromException should be used, I'll leave it as is.

Copy link
Contributor

@andystaples andystaples Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.
I think what I was missing was this:

For some reason, the ToString() is only called by grpc log, while the grpc worker only pass exception message and inner Failure

Very surprising, TBH.

When I was talking about the "properties of the user code exception" I meant the properties of the exception exposed to the ooproc worker - before my change that added DurableSerializationException, the native user code exception type was thrown all the way up until the gRPC logging step. I was trying to be extra careful not to change behavior of the worker, especially for "special case" exception types like OutOfMemoryException.

Anyway, based on the test case (which looks great, btw) and your own experimentation, I'm good to approve this PR. Thanks for investigating this!

@nytian nytian requested a review from andystaples March 19, 2025 22:29
Copy link
Contributor

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@@ -11,14 +11,20 @@ internal class DurableSerializationException : Exception
// We set the base class properties of this exception to the same as the parent,
// so that methods in the worker after this can still (typically) access the same information vs w/o
// this exception type.
internal DurableSerializationException(Exception fromException) : base(fromException.Message, fromException.InnerException)
internal DurableSerializationException(Exception fromException) : base(CreateExceptionMessage(fromException), fromException.InnerException)
Copy link
Contributor

@andystaples andystaples Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.
I think what I was missing was this:

For some reason, the ToString() is only called by grpc log, while the grpc worker only pass exception message and inner Failure

Very surprising, TBH.

When I was talking about the "properties of the user code exception" I meant the properties of the exception exposed to the ooproc worker - before my change that added DurableSerializationException, the native user code exception type was thrown all the way up until the gRPC logging step. I was trying to be extra careful not to change behavior of the worker, especially for "special case" exception types like OutOfMemoryException.

Anyway, based on the test case (which looks great, btw) and your own experimentation, I'm good to approve this PR. Thanks for investigating this!

@andystaples
Copy link
Contributor

BTW - I see the failure in mssql E2E tests - NBD for me. I'll need to investigate this at some point, but definitely not related to these changes.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -64,6 +64,7 @@
<PackageVersion Include="Microsoft.CodeAnalysis" Version="3.9.0" />
<PackageVersion Include="Microsoft.CodeAnalysis.Workspaces.MSBuild" Version="3.9.0" />
<PackageVersion Include="Microsoft.Diagnostics.Tracing.TraceEvent" Version="2.0.65" />
<PackageVersion Include="Microsoft.DurableTask.Abstractions" Version="1.8.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 1.8.1, which is the latest version?

@nytian nytian merged commit 21153b4 into dev Mar 25, 2025
14 of 15 checks passed
@nytian nytian deleted the nytian/isolated-activity-exception-type branch March 25, 2025 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants