Skip to content

Implement Nexus operation cancellation types #1917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
May 6, 2025
Merged

Conversation

pdoerner
Copy link
Contributor

@pdoerner pdoerner commented Apr 16, 2025

What was changed

Implemented cancellation types for Nexus operations which specify what action to take when the caller context is cancelled.

  • ABANDON -> Do not request cancellation of the operation.
  • TRY_CANCEL -> Initiate a cancellation request and immediately report cancellation to the caller.
  • WAIT_REQUESTED -> Request cancellation of the operation and wait for confirmation that the request was received.
  • WAIT_COMPLETED -> Wait for operation completion. Default.
    These options are set on workflow.NexusOperationOptions which are passed in when starting an operation.

Depends on temporalio/api#572

Why?

To give users more control over what happens to their handler workflows when their caller workflows are cancelled.

Checklist

  1. Closes

  2. How was this tested:
    Updated functional tests

  3. Any docs updates needed?

@pdoerner pdoerner requested a review from a team as a code owner April 16, 2025 12:46

const (
// Nexus operation cancellation type is unknown.
NexusOperationCancellationTypeUnspecified NexusOperationCancellationType = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably could just make this TryCancel and not have an Unspecified (this is not a proto enum, don't have to follow proto enum rules). But if we think we may ever need to differentiate between unset and explicitly-set default, can leave here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here but I thought having an explicit unset value might be useful if we decide to add more types or change the default in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's unlikely that we'll do either, but I could see a situation where we want to have a worker-level default someday or something and so knowing unspecified vs explicit try cancel works well there. So I'm ok leaving it as is.

Comment on lines 2667 to 2670
// CancellationType - Indicates what action should be taken when the caller is cancelled.
//
// Optional: defaults to NexusOperationCancellationTypeTryCancel.
CancellationType NexusOperationCancellationType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm for current released SDK without this PR, the behavior is the same as TryCancel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is TryCancel unless the operation has started, then it is WaitCompleted

Copy link
Member

@cretz cretz Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so are we changing default cancel behavior for post-started operations? If so, this may require a replay test or an SDK flag (our approach to safe behavior changes) or something unless I'm thinking about it wrong...

So say in the current-before-this-PR SDK I have a workflow that starts a Nexus operation, waits for it to start, issues a cancel, and then the operation eventually completed successfully (maybe several tasks later). And all of this is in history. Now take the same code and same history and run it through a replayer on this-PR SDK. Won't the Nexus operation now fail as canceled where it succeeded before and therefore it give a non-deterministic exception because the this-PR SDK code applies try-cancel post-start and the current-before-this-PR SDK code applies wait-completed post-start?

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I guess that might be an issue? The history that is generated should be the same, but the operation future would get unblocked sooner with a canceled error. As I noted in another comment, it looked to me like it was a bug that we were waiting for completion after the operation had started. I'm not 100% certain what the intended behavior was before this PR but I could change the default to WaitCompleted which I think would cover that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the default has to be backwards compatible so it has to be WaitCompleted .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lack history compatibility tests and probably should add them before we do any changes around here, but yes WaitCompleted should be backwards compatible

Copy link
Member

@cretz cretz Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes WaitCompleted should be backwards compatible

I was under the impression that WaitCompleted means if it's canceled before start it still waits until completed which may succeed whereas current code before the PR will fail if you cancel before start.

Is it true the behavior today before this PR is "TryCancel-pre-start" and "WaitCompleted-post-start"? If so, that is not the same as "WaitCompleted-pre-and-post-start" there is no enum to represent that, so we have to add a flag or have a way to represent that I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitCompleted means if it's canceled before start it still waits until completed

Yes that is true, and that should be true today as well. The nexus operation can also be cancelled before the command was ever sent to the server, that will cancel the future instantly

Copy link
Member

@cretz cretz Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, @pdoerner mentioned above that it was "TryCancel-pre-start" and "WaitCompleted-post-start", but you're saying it was "WaitCompleted" always? @pdoerner - can you confirm that it is not "TryCancel-pre-start" as mentioned at #1917 (comment)? Or maybe by pre-start you meant pre-command but not necessarily pre-nexus-start?

Regardless, yes if today's behavior pre-nexus-start (not to be confused with pre-command) and post-nexus-start is WaitCompleted, we should change this default to be WaitCompleted too. Concur with adding tests to confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Quinn is correct. I was confusing pre-command with pre-start. I believe WaitCompleted should always be backwards compatible and I have added a replay test to confirm.

switch event.EventType {
case enumspb.EVENT_TYPE_NEXUS_OPERATION_CANCEL_REQUEST_COMPLETED:
attrs := event.GetNexusOperationCancelRequestCompletedEventAttributes()
scheduledEventID = attrs.GetScheduledEventId()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to be safe for use, it is imperative that temporalio/api#564 not be released before temporalio/api#572. If there is an API version that has the first and not the second, this code can have issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like unfortunately that did happen. I will update this PR to handle that case.

Copy link
Member

@cretz cretz Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the pain of frequently releasing things before we know how they will be used, we get into a situation where we now have 1.47.0 that we will have to support.

@Quinn-With-Two-Ns
Copy link
Contributor

@pdoerner are you able to add replay tests before or along with this PR? Happy to help or provide instruction here as well

@pdoerner
Copy link
Contributor Author

@pdoerner are you able to add replay tests before or along with this PR? Happy to help or provide instruction here as well

Yes, I will add them. Aiming to have this PR ready for re-review later today or tomorrow.

@pdoerner
Copy link
Contributor Author

pdoerner commented May 1, 2025

New functional tests I added will fail in CI (I tested them locally) until we have a newer CLI version that supports the new history events. To that effect I opened temporalio/cli#798 so we can either merge that and upgrade the version in the functional tests or I can disable the tests for now.

const (
// Nexus operation cancellation type is unknown.
//
// Exposed as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing link to where it is exposed

Copy link
Contributor Author

@pdoerner pdoerner May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doclink tool keeps removing them when I run it with -fix because it thinks they are stale. not too sure what's going on there but it looks like it is blocking the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I will remove them for now to allow the tests to run but LMK how I should resolve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuandrew can you take a look, seems like a bug in the tool

Copy link
Contributor

@yuandrew yuandrew May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is by design, the doclink tool only maps type aliases, but in this case these are 2 separate definitions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH we don't map these enum type values?

},
)

func CancelNexusOperationWorkflow(ctx workflow.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cover all possible previous state transitions we could have taken here basically:

  • cancel before the operation start command is sent to the server
  • cancel after the command is sent but before nexus operation is started
  • cancel after the operation is started but before finished
  • cancel after finished

We also need to make sure the workflow generates different commands on these different code paths so the SDK can detect the non determinism

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though agree with @Quinn-With-Two-Ns about the other replay scenarios. I know it's a bit annoying to add these replay tests (have to generate history first on older server, capture it, etc). Thanks for sticking with it!

replayer := worker.NewWorkflowReplayer()
replayer.RegisterWorkflow(CancelNexusOperationWorkflow)
err := replayer.ReplayWorkflowHistoryFromJSONFile(ilog.NewDefaultLogger(), "nexus-operation-cancelled.json")
s.NoError(err)
Copy link
Member

@cretz cretz May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, can you confirm that if we had the default as TryCancel in the code in this PR this test would fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would, I did test that manually.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for @Quinn-With-Two-Ns approval though

@Quinn-With-Two-Ns
Copy link
Contributor

Looks like CI is failing because it can't download the test server, you may want to follow up that the release was done properly.

@pdoerner
Copy link
Contributor Author

pdoerner commented May 1, 2025

Looks like CI is failing because it can't download the test server, you may want to follow up that the release was done properly.

Yeah I've tried it a few times. I am able to download it locally so I don't think it is an issue with the release process. Probably some Github cache or something.

@pdoerner pdoerner enabled auto-merge (squash) May 1, 2025 22:13
@pdoerner pdoerner disabled auto-merge May 2, 2025 17:19
@Quinn-With-Two-Ns
Copy link
Contributor

@pdoerner just FYI you need to disable the new tests you added for the docker-compose tests since it uses the latests released OSS server.

@pdoerner pdoerner merged commit 6d2cb6a into master May 6, 2025
18 checks passed
@pdoerner pdoerner deleted the nexus-cancellation-types branch May 6, 2025 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants