-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Access] Add draft of access api error framework #7106
base: peter/refactor-access-subscription
Are you sure you want to change the base?
[Access] Add draft of access api error framework #7106
Conversation
|
||
// ErrorToStatusError converts an Access API error into a grpc status error. The input may either | ||
// be a status.Error already, or an access sentinel error. | ||
func ErrorToStatusError(err error) StatusError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each API has unique error reporting protocols (grpc vs rest vs websockets). Create API specific converters that return the appropriate mapping from access sentinels.
// it is possible for a client to request a finalized block from us | ||
// containing some collection, then get a not found error when requesting | ||
// that collection. These clients should retry. | ||
err = rpc.ConvertStorageError(fmt.Errorf("please retry for collection in finalized block: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend
should stop returning grpc status codes, and instead return access sentinel errors
if err != nil { | ||
return fmt.Errorf("could not ping collection node: %w", err) | ||
if _, err := b.staticCollectionRPC.Ping(ctx, &accessproto.PingRequest{}); err != nil { | ||
return status.Errorf(status.Code(err), "could not ping collection node: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this is actually used by anyone. I'm tempted to remove it
// Otherwise, it throws an irrecoverable exception | ||
// Note: this method will not unwrap the error. the passed error must be an instance of a sentinel | ||
// error. | ||
func RequireAccessError(ctx context.Context, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be used for more complex endpoints that call into other methods to ensure all unexpected errors are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add your comment as context to the goDoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job, I like this approach a lot and I think it makes sense considering the fact we are dealing with access API. My only comment would be that we need this clearly documented and defined somewhere.
It seems we agree on the fact that when dealing with protocol layer components we need to ensure that any error returned from the protocol layer is an expected error, everything else is considered as an unexpected behavior and leads to an irrecoverable exception. Using your approach we are doing this check at latest at the "backend" level and all higher level abstractions like the access API is dealing only with benign errors. As I said, I support that as long as we have a particular place where such error handling takes place. Creating a guideline will result in improved clarity and better code structure in general since it will be easier for engineers to argue about safety of the logic.
Correct me I am wrong but eventually each function from "backend" that implements some endpoint must include RequireNoError/RequireErrorIs/RequireAccessError
, this way we ensure that only benign errors are propagated to the caller.
Yes, exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preliminary set of minor comments
func NewOutOfRangeError(err error) OutOfRangeError { | ||
return OutOfRangeError{err: err} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional] since you are mentioning it:
Lines 139 to 140 in df07206
// This is a more specific version of DataNotFoundError, where the data is known to eventually exist, but | |
// currently is not known. |
we could also implement it accordingly
func NewOutOfRangeError(err error) OutOfRangeError { | |
return OutOfRangeError{err: err} | |
} | |
func NewOutOfRangeError(err error) error { | |
return NewDataNotFoundError("OutOfRange", OutOfRangeError{err: err}) | |
} |
similarly for the PreconditionFailedError
below, which you are identifying as special sub-cases of InvalidRequestError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach very much. It would be great, if you could add minimal documentation to the Access API interface and the corresponding protocol-level implementation where the critical error checks have to happen.
suggestion for the Access API interface
// API provides all public-facing functionality of the Flow Access API.
// CAUTION: SIMPLIFIED ERROR HANDLING
// - All errors returned by this API are guaranteed to be benign. The
// node can continue normal operations after such errors.
// - To prevent delivering incorrect results to clients, in case of an
// error, all other return values should be discarded.
type API interface {
Essentially, we are documenting the requirement for all implementations of the interface:
all implementations have to crash on anything except known benign errors
I am thinking about implementations like
func (b *backendExecutionResults) GetExecutionResultForBlockID(ctx context.Context, blockID flow.Identifier) (*flow.ExecutionResult, error) { flow-go/engine/access/rpc/backend/backend_execution_results.go
Lines 25 to 26 in df07206
// GetExecutionResultByID gets an execution result by its ID. func (b *backendExecutionResults) GetExecutionResultByID(ctx context.Context, id flow.Identifier) (*flow.ExecutionResult, error) {
This being documented only at the interface level bears the risk of this requirement accidentally being overlooked while updating the implementations. I think these risks are reasonably mitigated by just copying this documentation to all protocol-level backend implementations.
GetExecutionResultForBlockID
could read something like:// GetExecutionResultForBlockID gets an execution result by ID of the executed block // // CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention // As documented in the [access.API], which we partially implement with this function // - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. // - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! func (b *backendExecutionResults) GetExecutionResultForBlockID(ctx context.Context, blockID flow.Identifier) (*flow.ExecutionResult, error) {
I tried to make the wording quite generic, so you can just copy the "caution"-block, e.g.
GetExecutionResultByID
's goDoc would extend to:// GetExecutionResultByID gets an execution result by its ID. // // CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention // As documented in the [access.API], which we partially implement with this function // - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. // - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! func (b *backendExecutionResults) GetExecutionResultByID(ctx context.Context, id flow.Identifier) (*flow.ExecutionResult, error) {
Suggestion:
-
Unfortunately, I have to recommend to copy the "caution"-block into the implementations of all methods of the Access API interface
- Though, I would only add this doc to the method implementations in package
engine/access/rpc/backend
for example, because here we are interfacing with the protocol-layer code that might throw irrecoverable or unclassified fatal errors. - Yet, another implementation of
GetExecutionResultByID
exists in theengine/access/rest/apiproxy
package. This code does not interface with protocol-level components. Since that is pretty clear from context, I would suggest to not invest the time adding "caution"-blocks to the Access API implementations here. Neither do I think that the corresponding functions in theengine/access/rest/http/routes
package require explicitly documentation.
I hope these examples help to clarify where I think extra documentation would be beneficial vs where not.
- Though, I would only add this doc to the method implementations in package
Sorry, that this is boring work .. hopefully doesn't take long. I think the time is well-spent future-proofing the codebase by cleanly documenting the boundary where the error-handling convention changes.
// No errors are expected during normal operation. | ||
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | ||
// not propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// No errors are expected during normal operation. | |
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | |
// not propagated. | |
// | |
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention | |
// As documented in the [access.API], which we partially implement with this function | |
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. | |
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! |
// No errors are expected during normal operation. | ||
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | ||
// not propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// No errors are expected during normal operation. | |
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | |
// not propagated. | |
// | |
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention | |
// As documented in the [access.API], which we partially implement with this function | |
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. | |
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! |
// Expected errors: | ||
// - access.DataNotFoundError if the collection is not found. | ||
// | ||
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | ||
// not propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Expected errors: | |
// - access.DataNotFoundError if the collection is not found. | |
// | |
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | |
// not propagated. | |
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention | |
// As documented in the [access.API], which we partially implement with this function | |
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. | |
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! | |
// | |
// Expected sentinel errors providing details to clients about failed requests: | |
// - access.DataNotFoundError if the collection is not found. |
// Expected errors: | ||
// - access.DataNotFoundError if the collection is not found. | ||
// | ||
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | ||
// not propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Expected errors: | |
// - access.DataNotFoundError if the collection is not found. | |
// | |
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | |
// not propagated. | |
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention | |
// As documented in the [access.API], which we partially implement with this function | |
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. | |
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! | |
// | |
// Expected sentinel errors providing details to clients about failed requests: | |
// - access.DataNotFoundError if the collection is not found. |
) error { | ||
if !errors.Is(genericErr, storage.ErrNotFound) { | ||
return genericErr | ||
// Will return the original error, possibly wrapped with additional context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓⚠️ ❓ I am wondering if we need to consider the next layer of code here
Please confirm that my understanding here is correct:
// Will return the original error, possibly wrapped with additional context. | |
// Will return the original error, possibly wrapped with additional context. | |
// CAUTION: this function might return irrecoverable errors or generic fatal from the lower protocol layers |
Lets take a look at a couple of places that are calling resolveHeightError
:
-
flow-go/engine/access/rpc/backend/backend_accounts.go
Lines 64 to 72 in 95bbc46
// GetAccountAtBlockHeight returns the account details at the given block height. func (b *backendAccounts) GetAccountAtBlockHeight( ctx context.Context, address flow.Address, height uint64, ) (*flow.Account, error) { blockID, err := b.headers.BlockIDByHeight(height) if err != nil { return nil, rpc.ConvertStorageError(resolveHeightError(b.state.Params(), height, err)) resolveHeightError
will just pass the error through; we'll wrap it as anInternal
server error and return it to the client. To summarize: there was a fatal error and we have not recognized it as such.Instead, I think lines 70-73 in
GetAccountAtBlockHeight
should read:blockID, err := b.headers.BlockIDByHeight(height) if err != nil { err = access.RequireErrorIs(ctx, err, storage.ErrNotFound) // fatal for any error other than storage.ErrNotFound return nil, rpc.ConvertStorageError(resolveHeightError(b.state.Params(), height, err)) }
-
analogous change requests I have for functions
GetAccountBalanceAtBlockHeight
GetAccountKeyAtBlockHeight
GetAccountKeysAtBlockHeight
GetBlockByHeight
GetBlockHeaderByHeight
GetEventsForHeightRange
. In this function, I noticed additional places where we call into the protocol logic:flow-go/engine/access/rpc/backend/backend_events.go
Lines 70 to 77 in 95bbc46
// get the latest sealed block header sealed, err := b.state.Sealed().Head() if err != nil { // sealed block must be in the store, so throw an exception for any error err := irrecoverable.NewExceptionf("failed to lookup sealed header: %w", err) irrecoverable.Throw(ctx, err) return nil, err } ExecuteScriptAtBlockHeight
Essentially, I am worried about any place you are calling into the low-level protocol. For all of those method, I would suggest to include the "caution"-goDoc-block as in my comment above and crash on any error except the expected error.
-
For
getAccountFromLocalStorage
I am less sure, because I am not sure how rigorous the execution team is doing their error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for functions in this file, I would suggest to include the "caution"-block in their goDocs:
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!
// No errors are expected during normal operation. | ||
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | ||
// not propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// No errors are expected during normal operation. | |
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | |
// not propagated. | |
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention | |
// As documented in the [access.API], which we partially implement with this function | |
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. | |
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! |
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | ||
// not propagated. | ||
// | ||
// The block may or may not be finalized in the future; the client can retry later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are | |
// not propagated. | |
// | |
// The block may or may not be finalized in the future; the client can retry later. | |
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention | |
// As documented in the [access.API], which we partially implement with this function | |
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors. | |
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context! |
} | ||
|
||
return data, nil | ||
} | ||
|
||
// GetProtocolStateSnapshotByBlockID returns serializable Snapshot for a block, by blockID. | ||
// The requested block must be finalized, otherwise an error is returned. | ||
// | ||
// Expected errors during normal operation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Expected errors during normal operation: | |
// Dedicated sentinel errors providing details to clients about failed requests: |
} | ||
return data, nil | ||
} | ||
|
||
// GetProtocolStateSnapshotByHeight returns serializable Snapshot by block height. | ||
// The block must be finalized (otherwise the by-height query is ambiguous). | ||
// | ||
// Expected errors during normal operation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Expected errors during normal operation: | |
// Dedicated sentinel errors providing details to clients about failed requests: |
snapshot := b.state.Final() | ||
data, err := convert.SnapshotToBytes(snapshot) | ||
if err != nil { | ||
return nil, status.Errorf(codes.Internal, "failed to convert snapshot to bytes: %v", err) | ||
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at the moment, we do not guarantee that a snapshot can always be constructed for any finalized block (in other words, I think this call can fail
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err)) | |
err = access.RequireErrorIs(ctx, err, protocol.ErrSealingSegmentBelowRootBlock, protocol.NewUnfinalizedSealingSegmentErrorf("")) | |
return nil, fmt.Errorf("snapshots might not be possible for every block: %w", err) |
I think @jordanschalm knows if we would always be able to generate a snapshot for any block (including the root block when the node is freshly bootstrapped). Only if that is the case, we can use access.RequireNoError
here .
// The block exists, but no block has been finalized at its height. Therefore, this block | ||
// may be finalized in the future, and the client can retry. | ||
return nil, access.NewPreconditionFailedError( | ||
fmt.Errorf("failed to retrieve snapshot for block with height %d: block not finalized and is above finalized height", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Errorf("failed to retrieve snapshot for block with height %d: block not finalized and is above finalized height", | |
fmt.Errorf("failed to retrieve snapshot: block still pending finalization", |
// storage.ErrNotFound is specifically NOT allowed since the snapshot's reference block must exist | ||
// within the snapshot. | ||
err = access.RequireErrorIs(ctx, err, state.ErrUnknownSnapshotReference) | ||
return nil, access.NewDataNotFoundError("snapshot", fmt.Errorf("failed to retrieve a valid snapshot: block not found")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, access.NewDataNotFoundError("snapshot", fmt.Errorf("failed to retrieve a valid snapshot: block not found")) | |
return nil, access.NewDataNotFoundError("snapshot", fmt.Errorf("failed to retrieve snapshot: block not found")) |
} | ||
|
||
if blockIDFinalizedAtHeight != blockID { | ||
// A different block than what was queried has been finalized at this height. | ||
// Therefore, the queried block will never be finalized. | ||
return nil, status.Errorf(codes.InvalidArgument, | ||
"failed to retrieve snapshot for block: block not finalized and is below finalized height") | ||
return nil, access.NewInvalidRequestError(fmt.Errorf("failed to retrieve snapshot for block: block not finalized and is below finalized height")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, access.NewInvalidRequestError(fmt.Errorf("failed to retrieve snapshot for block: block not finalized and is below finalized height")) | |
return nil, access.NewInvalidRequestError(fmt.Errorf("failed to retrieve snapshot: block orphaned")) |
} | ||
|
||
data, err := convert.SnapshotToBytes(snapshot) | ||
if err != nil { | ||
return nil, status.Errorf(codes.Internal, "failed to convert snapshot to bytes: %v", err) | ||
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if we can guarantee that we are always be able to generate a snapshot for any block (including the root block when the node is freshly bootstrapped) we can use access.RequireNoError
here (same as comment above)
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err)) | |
err = access.RequireErrorIs(ctx, err, protocol.ErrSealingSegmentBelowRootBlock, protocol.NewUnfinalizedSealingSegmentErrorf("")) | |
return nil, fmt.Errorf("snapshots might not be possible for every block: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more functions that I noticed probably needing handling of exceptions
} | ||
|
||
data, err := convert.SnapshotToBytes(snapshot) | ||
if err != nil { | ||
return nil, status.Errorf(codes.Internal, "failed to convert snapshot to bytes: %v", err) | ||
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if we can guarantee that we are always be able to generate a snapshot for any block (including the root block when the node is freshly bootstrapped) we can use access.RequireNoError
here (same as comment above)
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err)) | |
err = access.RequireErrorIs(ctx, err, protocol.ErrSealingSegmentBelowRootBlock, protocol.NewUnfinalizedSealingSegmentErrorf("")) | |
return nil, fmt.Errorf("snapshots might not be possible for every block: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only interact with a single protocol-level component here:
transactions storage.Transactions |
in addition to TransactionsLocalDataProvider
, which I commented on below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not using any protocol components, so its probably fine as is (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors returned by these protocol components should ideally be checked for unexpected exceptions:
flow-go/engine/access/rpc/backend/transactions_local_data_provider.go
Lines 48 to 50 in e176fe3
state protocol.State | |
collections storage.Collections | |
blocks storage.Blocks |
in many cases, we just convert an exception and send a message to the client, e.g.
return nil, rpc.ConvertStorageError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is another component, where we ideally would like to check errors returned by these lower-level protocol components 👇
flow-go/engine/common/rpc/execution_node_identities_provider.go
Lines 30 to 31 in e176fe3
executionReceipts storage.ExecutionReceipts | |
state protocol.State |
and crash in case of unexpected exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from Yurii's comment
clearly documented and defined somewhere.
eventually each function from "backend" that implements some endpoint must include RequireNoError/RequireErrorIs/RequireAccessError, this way we ensure that only benign errors are propagated to the caller.
Most of my comments are aimed at achieving this. Though, my comments are likely not an exhaustive list of all places where such error checks should be added.
Experimenting with a potential framework we could use within the access APIs to improve error handling and make it more consistent across the different api types.
Some general comments about changes:
backend
api interface should handle all irrecoverable exceptions and only return benign errors. Callers into the backend can treat any unclassified errors as "internal errors" that abort request processing.backend
should map errors into one of the access sentinel errors, which in turn have mappings into protocol specific (rest/grpc/websockets/etc) representations.