Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input lost on unhandled exception thrown in durable entities #1751

Open
asos-sarahwood opened this issue Mar 26, 2021 · 7 comments
Open

Input lost on unhandled exception thrown in durable entities #1751

asos-sarahwood opened this issue Mar 26, 2021 · 7 comments

Comments

@asos-sarahwood
Copy link

Description

When we throw an unhandled exception from within a durable entity we see a failure but lose the input. There are no input message found on the queues in the storage account.

We are using the function based syntax

Expected behavior

As we have not overridden the default value for RollbackOnException we expected the state not to be persisted and the input to be retried.

Actual behavior

We lose the input message

@ghost ghost added the Needs: Triage 🔍 label Mar 26, 2021
@sebastianburckhardt
Copy link
Collaborator

From your description, it appears that the implementation is working as designed: there is no automatic retry for entity operations if the application code throws an exception.

If retries are important in your situation, your best option at the moment is to implement your own retries. You can use standard exception handling to catch the exception. To retry it, you can use a loop, or you can have the entity send a signal to itself (or perhaps a delayed signal).

@olitomlinson
Copy link
Contributor

@sebastianburckhardt

I wonder if the docs need to be more explicit about there not being any retry by default on an Entity Operation? I wrongly assumed there was a retry...

Also, given that entity operations are dequeued in batches, does that mean they are also ACKd in batches too?

So, given a batch of 50 entity operations (which may span N entities in the partition) if the 25th throws an exception, does that rollback all the state for entities 1-25?

@sebastianburckhardt
Copy link
Collaborator

wonder if the docs need to be more explicit about there not being any retry by default on an Entity Operation?

I agree that more documentation on exception handling is needed. Note that the design is consistent with how we handle exceptions in activities. The runtime considers an operation that throws an exception as completed (just as for activities). The exception is considered the "result" of that operation and is propagated back to the caller (if the operation was a call and not a signal). If it is a signal, the exception cannot be handled by the signaller, but it is still visible in the traces.

This distinction between runtime-internal faults, and exceptions thrown by application code is very important for the programming model in general. Runtime-internal faults are retried transparently. But exceptions thrown by the application are not. The rationale is that the runtime does not really know the meaning of exceptions thrown from application code, e.g. it cannot know whether the exception reflects a transient fault and should be retried, or whether it is an issue that cannot be resolved by retrying and that should therefore be handled by an exception handler in the calling orchestration.

So, given a batch of 50 entity operations (which may span N entities in the partition) if the 25th throws an exception, does that rollback all the state for entities 1-25?

Depends on the setting of RollbackOnException.

  • RollbackOnException=true (default)
    Only operation 25 is rolled back. All other operations (1-24, 26-50) are processed normally.
    The rollback resets the entity state to what it was before starting operation 25, and suppresses any signals that were sent or orchestrations that were started by operation 25.

  • RollbackOnException=false
    There is no rollback: everything the operation 25 does before throwing the exception remains in effect (all entity state updates, signals sent, and orchestrations started). And operations 26-50 are also processed as usual.

One thing I realize as I am writing this is that RollbackOnException is probably not supported for out-of-proc languages like JavaScript or Python. @ConnorMcMahon or @davidmrdavid can probably clarify this. We may need separate docs for those, I have not had a close look at how they implement this.

@olitomlinson
Copy link
Contributor

@sebastianburckhardt awesome thanks!

Follow up...

RollbackOnException=true (default)
Only operation 25 is rolled back. All other operations (1-24, 26-50) are processed normally.
The rollback resets the entity state to what it was before starting operation 25, and suppresses any signals that were sent or orchestrations that were started by operation 25.

Are signals which may be destined for entities in a different partition also rolled back? [No caveats/nuances etc]

@davidmrdavid
Copy link
Contributor

Thanks for the ping @sebastianburckhardt. I have personally not validated the RollbackOnException scenario in OOProc yet. I'm tracking that item in these 2 tickets:

@sebastianburckhardt
Copy link
Collaborator

Are signals which may be destined for entities in a different partition also rolled back? [No caveats/nuances etc]

The rollback includes all the effects that the operation performs via the IDurableEntityContext. The partition does not matter.

Any effects performed in other ways (e.g. direct I/O calls, or using an injected IDurableClient) are not visible to the runtime and are not rolled back.

@mpaul31
Copy link

mpaul31 commented Mar 31, 2021

Has anyone tried the new azure functions retry policies with entities?

https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages?tabs=csharp#retry-policies-preview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants