-
Notifications
You must be signed in to change notification settings - Fork 239
Implement Nexus operation cancellation types #1917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
const ( | ||
// Nexus operation cancellation type is unknown. | ||
NexusOperationCancellationTypeUnspecified NexusOperationCancellationType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably could just make this TryCancel
and not have an Unspecified
(this is not a proto enum, don't have to follow proto enum rules). But if we think we may ever need to differentiate between unset and explicitly-set default, can leave here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here but I thought having an explicit unset value might be useful if we decide to add more types or change the default in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's unlikely that we'll do either, but I could see a situation where we want to have a worker-level default someday or something and so knowing unspecified vs explicit try cancel works well there. So I'm ok leaving it as is.
internal/workflow.go
Outdated
// CancellationType - Indicates what action should be taken when the caller is cancelled. | ||
// | ||
// Optional: defaults to NexusOperationCancellationTypeTryCancel. | ||
CancellationType NexusOperationCancellationType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm for current released SDK without this PR, the behavior is the same as TryCancel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is TryCancel
unless the operation has started, then it is WaitCompleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so are we changing default cancel behavior for post-started operations? If so, this may require a replay test or an SDK flag (our approach to safe behavior changes) or something unless I'm thinking about it wrong...
So say in the current-before-this-PR SDK I have a workflow that starts a Nexus operation, waits for it to start, issues a cancel, and then the operation eventually completed successfully (maybe several tasks later). And all of this is in history. Now take the same code and same history and run it through a replayer on this-PR SDK. Won't the Nexus operation now fail as canceled where it succeeded before and therefore it give a non-deterministic exception because the this-PR SDK code applies try-cancel post-start and the current-before-this-PR SDK code applies wait-completed post-start?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I guess that might be an issue? The history that is generated should be the same, but the operation future would get unblocked sooner with a canceled error. As I noted in another comment, it looked to me like it was a bug that we were waiting for completion after the operation had started. I'm not 100% certain what the intended behavior was before this PR but I could change the default to WaitCompleted
which I think would cover that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the default has to be backwards compatible so it has to be WaitCompleted
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We lack history compatibility tests and probably should add them before we do any changes around here, but yes WaitCompleted
should be backwards compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes WaitCompleted should be backwards compatible
I was under the impression that WaitCompleted
means if it's canceled before start it still waits until completed which may succeed whereas current code before the PR will fail if you cancel before start.
Is it true the behavior today before this PR is "TryCancel-pre-start" and "WaitCompleted-post-start"? If so, that is not the same as "WaitCompleted-pre-and-post-start" there is no enum to represent that, so we have to add a flag or have a way to represent that I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitCompleted means if it's canceled before start it still waits until completed
Yes that is true, and that should be true today as well. The nexus operation can also be cancelled before the command was ever sent to the server, that will cancel the future instantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, @pdoerner mentioned above that it was "TryCancel-pre-start" and "WaitCompleted-post-start", but you're saying it was "WaitCompleted" always? @pdoerner - can you confirm that it is not "TryCancel-pre-start" as mentioned at #1917 (comment)? Or maybe by pre-start you meant pre-command but not necessarily pre-nexus-start?
Regardless, yes if today's behavior pre-nexus-start (not to be confused with pre-command) and post-nexus-start is WaitCompleted
, we should change this default to be WaitCompleted
too. Concur with adding tests to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Quinn is correct. I was confusing pre-command with pre-start. I believe WaitCompleted should always be backwards compatible and I have added a replay test to confirm.
switch event.EventType { | ||
case enumspb.EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUEST_COMPLETED: | ||
attrs := event.GetNexusOperationCancelRequestCompletedEventAttributes() | ||
scheduledEventID = attrs.GetScheduledEventId() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to be safe for use, it is imperative that temporalio/api#564 not be released before temporalio/api#572. If there is an API version that has the first and not the second, this code can have issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like unfortunately that did happen. I will update this PR to handle that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the pain of frequently releasing things before we know how they will be used, we get into a situation where we now have 1.47.0 that we will have to support.
@pdoerner are you able to add replay tests before or along with this PR? Happy to help or provide instruction here as well |
Yes, I will add them. Aiming to have this PR ready for re-review later today or tomorrow. |
New functional tests I added will fail in CI (I tested them locally) until we have a newer CLI version that supports the new history events. To that effect I opened temporalio/cli#798 so we can either merge that and upgrade the version in the functional tests or I can disable the tests for now. |
internal/workflow.go
Outdated
const ( | ||
// Nexus operation cancellation type is unknown. | ||
// | ||
// Exposed as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing link to where it is exposed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the doclink tool keeps removing them when I run it with -fix
because it thinks they are stale. not too sure what's going on there but it looks like it is blocking the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I will remove them for now to allow the tests to run but LMK how I should resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuandrew can you take a look, seems like a bug in the tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this is by design, the doclink tool only maps type aliases, but in this case these are 2 separate definitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH we don't map these enum type values?
test/replaytests/workflows.go
Outdated
}, | ||
) | ||
|
||
func CancelNexusOperationWorkflow(ctx workflow.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cover all possible previous state transitions we could have taken here basically:
- cancel before the operation start command is sent to the server
- cancel after the command is sent but before nexus operation is started
- cancel after the operation is started but before finished
- cancel after finished
We also need to make sure the workflow generates different commands on these different code paths so the SDK can detect the non determinism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM though agree with @Quinn-With-Two-Ns about the other replay scenarios. I know it's a bit annoying to add these replay tests (have to generate history first on older server, capture it, etc). Thanks for sticking with it!
test/replaytests/replay_test.go
Outdated
replayer := worker.NewWorkflowReplayer() | ||
replayer.RegisterWorkflow(CancelNexusOperationWorkflow) | ||
err := replayer.ReplayWorkflowHistoryFromJSONFile(ilog.NewDefaultLogger(), "nexus-operation-cancelled.json") | ||
s.NoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, can you confirm that if we had the default as TryCancel
in the code in this PR this test would fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would, I did test that manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's wait for @Quinn-With-Two-Ns approval though
Looks like CI is failing because it can't download the test server, you may want to follow up that the release was done properly. |
Yeah I've tried it a few times. I am able to download it locally so I don't think it is an issue with the release process. Probably some Github cache or something. |
@pdoerner just FYI you need to disable the new tests you added for the docker-compose tests since it uses the latests released OSS server. |
What was changed
Implemented cancellation types for Nexus operations which specify what action to take when the caller context is cancelled.
ABANDON
-> Do not request cancellation of the operation.TRY_CANCEL
-> Initiate a cancellation request and immediately report cancellation to the caller.WAIT_REQUESTED
-> Request cancellation of the operation and wait for confirmation that the request was received.WAIT_COMPLETED
-> Wait for operation completion. Default.These options are set on
workflow.NexusOperationOptions
which are passed in when starting an operation.Depends on temporalio/api#572
Why?
To give users more control over what happens to their handler workflows when their caller workflows are cancelled.
Checklist
Closes
How was this tested:
Updated functional tests
Any docs updates needed?