Update Nexus error model to use Temporal failures#8799
Update Nexus error model to use Temporal failures#8799
Conversation
bergundy
left a comment
There was a problem hiding this comment.
I don't think I will have time to fully review this before I go on PTO but the overall direction looks great.
When you plan on marking this ready for review, I would suggest adding the following tests:
- Old server caller is compatible with a new server handler for start, cancel and callback requests
- Encoded attributes returned from the SDK are passed through
bergundy
left a comment
There was a problem hiding this comment.
I took about a half hour reviewing this and I still don't think I caught everything. Before we can merge this we need:
- Definitions and tests for what happens at each boundary for new and old components on either side.
- End to end tests that cover the behavior with old SDKs.
- End to end tests that cover the behavior with a new SDK after we have an implementation for the new paths.
common/nexus/failure.go
Outdated
| } | ||
| apiFailure.FailureInfo = &failurepb.Failure_ApplicationFailureInfo{ | ||
| ApplicationFailureInfo: &failurepb.ApplicationFailureInfo{ | ||
| // Make up a type here, it's not part of the Nexus Failure spec. |
There was a problem hiding this comment.
We have NexusSDKFailureErrorFailureInfo in the API PR.
There was a problem hiding this comment.
NexusSDKFailureErrorFailureInfo doesn't have a Details field. We only get to this point if we get an unexpected error type, so I thought it was better to capture the full information.
common/nexus/failure.go
Outdated
| } | ||
|
|
||
| func OperationErrorToTemporalFailure(opErr *nexus.OperationError) (*failurepb.Failure, error) { | ||
| func OperationErrorToTemporalFailure(opErr *nexus.OperationError, retryable bool) (*failurepb.Failure, error) { |
There was a problem hiding this comment.
You shouldn't need the retryable flag here, operation errors are non-retryable by definition and the resulting failure object should be a NexusOperationFailureInfo. This function doesn't seem necessary anymore, you should already have the original nexus failure on the operation error so all you need to do is convert to a temporal failure.
There was a problem hiding this comment.
I also thought it would be unnecessary but I had some trouble removing it as this is the only place where we have the specific handling for CanceledFailureInfo.
common/nexus/nexusrpc/client.go
Outdated
| } | ||
| return &nexus.HandlerError{ | ||
| Type: errorType, | ||
| Message: response.Status, |
There was a problem hiding this comment.
This includes the HTTP status code, we need to trim that.
| RetryBehavior: retryBehavior, | ||
| }, | ||
| }, | ||
| nf, err := nexus.DefaultFailureConverter().ErrorToFailure(handlerErr) |
There was a problem hiding this comment.
You should use the original failure here and then convert it to a temporal failure.
| require.Equal(t, enumsspb.NEXUS_OPERATION_STATE_BACKING_OFF, op.State()) | ||
| require.NotNil(t, op.LastAttemptFailure.GetNexusHandlerFailureInfo()) | ||
| require.Equal(t, "handler error (INTERNAL): internal server error", op.LastAttemptFailure.Message) | ||
| require.Equal(t, "internal server error", op.LastAttemptFailure.Message) |
There was a problem hiding this comment.
Please also check the handler error type in all of these tests.
| var failureErr *nexus.FailureError | ||
| var operationErr *nexus.OperationError | ||
| switch { | ||
| case errors.As(r.Error, &failureErr): |
There was a problem hiding this comment.
You should always get an operation error here.
service/frontend/nexus_handler.go
Outdated
| switch t := response.GetOutcome().(type) { | ||
| case *matchingservice.DispatchNexusTaskResponse_Failure: | ||
| oc.metricsHandler = oc.metricsHandler.WithTags(metrics.OutcomeTag("handler_error:" + t.Failure.GetNexusHandlerFailureInfo().GetType())) | ||
| nf, err := commonnexus.APIFailureToNexusFailure(t.Failure) |
There was a problem hiding this comment.
You want to create a properly structured nexus failure here with metadata type set to nexus.HandlerError or nexus.OperationError and make sure the cause chain is populated correctly.
9a80fc7 to
699b0ff
Compare
**What changed?** Use Temporal failures for sending Nexus response errors. **Why?** Consistency with other Temporal APIs and compatibility with encryption proxies. **Breaking changes** Not explicitly, but server and SDK will need to be able to handle both error formats based on `capabilities` field. **Server PR** temporalio/temporal#8799 --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
**What changed?** Use Temporal failures for sending Nexus response errors. **Why?** Consistency with other Temporal APIs and compatibility with encryption proxies. **Breaking changes** Not explicitly, but server and SDK will need to be able to handle both error formats based on `capabilities` field. **Server PR** temporalio/temporal#8799 --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
|
Closed in favor of #9290 |
What changed?
Many changes to support the Temporal SDK sending Temporal
Failures instead of errors.Depends on temporalio/api#682 and nexus-rpc/sdk-go#69
Why?
Consistency with other APIs and more rich information.
How did you test it?