Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Access] Add draft of access api error framework #7106

Open
wants to merge 9 commits into
base: peter/refactor-access-subscription
Choose a base branch
from

Conversation

peterargue
Copy link
Contributor

@peterargue peterargue commented Feb 28, 2025

Experimenting with a potential framework we could use within the access APIs to improve error handling and make it more consistent across the different api types.

Some general comments about changes:

  • The backend api interface should handle all irrecoverable exceptions and only return benign errors. Callers into the backend can treat any unclassified errors as "internal errors" that abort request processing.
  • The backend should map errors into one of the access sentinel errors, which in turn have mappings into protocol specific (rest/grpc/websockets/etc) representations.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@peterargue peterargue requested a review from a team as a code owner February 28, 2025 20:43

// ErrorToStatusError converts an Access API error into a grpc status error. The input may either
// be a status.Error already, or an access sentinel error.
func ErrorToStatusError(err error) StatusError {
Copy link
Contributor Author

@peterargue peterargue Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each API has unique error reporting protocols (grpc vs rest vs websockets). Create API specific converters that return the appropriate mapping from access sentinels.

// it is possible for a client to request a finalized block from us
// containing some collection, then get a not found error when requesting
// that collection. These clients should retry.
err = rpc.ConvertStorageError(fmt.Errorf("please retry for collection in finalized block: %w", err))
Copy link
Contributor Author

@peterargue peterargue Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend should stop returning grpc status codes, and instead return access sentinel errors

if err != nil {
return fmt.Errorf("could not ping collection node: %w", err)
if _, err := b.staticCollectionRPC.Ping(ctx, &accessproto.PingRequest{}); err != nil {
return status.Errorf(status.Code(err), "could not ping collection node: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this is actually used by anyone. I'm tempted to remove it

// Otherwise, it throws an irrecoverable exception
// Note: this method will not unwrap the error. the passed error must be an instance of a sentinel
// error.
func RequireAccessError(ctx context.Context, err error) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used for more complex endpoints that call into other methods to ensure all unexpected errors are handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add your comment as context to the goDoc.

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job, I like this approach a lot and I think it makes sense considering the fact we are dealing with access API. My only comment would be that we need this clearly documented and defined somewhere.

It seems we agree on the fact that when dealing with protocol layer components we need to ensure that any error returned from the protocol layer is an expected error, everything else is considered as an unexpected behavior and leads to an irrecoverable exception. Using your approach we are doing this check at latest at the "backend" level and all higher level abstractions like the access API is dealing only with benign errors. As I said, I support that as long as we have a particular place where such error handling takes place. Creating a guideline will result in improved clarity and better code structure in general since it will be easier for engineers to argue about safety of the logic.

Correct me I am wrong but eventually each function from "backend" that implements some endpoint must include RequireNoError/RequireErrorIs/RequireAccessError, this way we ensure that only benign errors are propagated to the caller.

@peterargue
Copy link
Contributor Author

Correct me I am wrong but eventually each function from "backend" that implements some endpoint must include RequireNoError/RequireErrorIs/RequireAccessError, this way we ensure that only benign errors are propagated to the caller.

Yes, exactly.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preliminary set of minor comments

Comment on lines +146 to +148
func NewOutOfRangeError(err error) OutOfRangeError {
return OutOfRangeError{err: err}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] since you are mentioning it:

flow-go/access/errors.go

Lines 139 to 140 in df07206

// This is a more specific version of DataNotFoundError, where the data is known to eventually exist, but
// currently is not known.

we could also implement it accordingly

Suggested change
func NewOutOfRangeError(err error) OutOfRangeError {
return OutOfRangeError{err: err}
}
func NewOutOfRangeError(err error) error {
return NewDataNotFoundError("OutOfRange", OutOfRangeError{err: err})
}

similarly for the PreconditionFailedError below, which you are identifying as special sub-cases of InvalidRequestError

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach very much. It would be great, if you could add minimal documentation to the Access API interface and the corresponding protocol-level implementation where the critical error checks have to happen.

suggestion for the Access API interface

// API provides all public-facing functionality of the Flow Access API.
// CAUTION: SIMPLIFIED ERROR HANDLING
//   - All errors returned by this API are guaranteed to be benign. The
//     node can continue normal operations after such errors.
//   - To prevent delivering incorrect results to clients, in case of an
//     error, all other return values should be discarded.
type API interface {

Essentially, we are documenting the requirement for all implementations of the interface:

all implementations have to crash on anything except known benign errors

I am thinking about implementations like

This being documented only at the interface level bears the risk of this requirement accidentally being overlooked while updating the implementations. I think these risks are reasonably mitigated by just copying this documentation to all protocol-level backend implementations.

  • GetExecutionResultForBlockID could read something like:
     // GetExecutionResultForBlockID gets an execution result by ID of the executed block
     //
     // CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
     // As documented in the [access.API], which we partially implement with this function
     //   - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
     //   - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!
     func (b *backendExecutionResults) GetExecutionResultForBlockID(ctx context.Context, blockID flow.Identifier) (*flow.ExecutionResult, error) {

I tried to make the wording quite generic, so you can just copy the "caution"-block, e.g.

  • GetExecutionResultByID's goDoc would extend to:
     // GetExecutionResultByID gets an execution result by its ID.
     //
     // CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
     // As documented in the [access.API], which we partially implement with this function
     //   - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
     //   - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!
     func (b *backendExecutionResults) GetExecutionResultByID(ctx context.Context, id flow.Identifier) (*flow.ExecutionResult, error) {

Suggestion:

  • Unfortunately, I have to recommend to copy the "caution"-block into the implementations of all methods of the Access API interface

    • Though, I would only add this doc to the method implementations in package engine/access/rpc/backend for example, because here we are interfacing with the protocol-layer code that might throw irrecoverable or unclassified fatal errors.
    • Yet, another implementation of GetExecutionResultByID exists in the engine/access/rest/apiproxy package. This code does not interface with protocol-level components. Since that is pretty clear from context, I would suggest to not invest the time adding "caution"-blocks to the Access API implementations here. Neither do I think that the corresponding functions in the engine/access/rest/http/routes package require explicitly documentation.

    I hope these examples help to clarify where I think extra documentation would be beneficial vs where not.

Sorry, that this is boring work .. hopefully doesn't take long. I think the time is well-spent future-proofing the codebase by cleanly documenting the boundary where the error-handling convention changes.

Comment on lines +303 to +305
// No errors are expected during normal operation.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// No errors are expected during normal operation.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
//
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!

Comment on lines +319 to +321
// No errors are expected during normal operation.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// No errors are expected during normal operation.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
//
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!

Comment on lines +357 to +361
// Expected errors:
// - access.DataNotFoundError if the collection is not found.
//
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Expected errors:
// - access.DataNotFoundError if the collection is not found.
//
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!
//
// Expected sentinel errors providing details to clients about failed requests:
// - access.DataNotFoundError if the collection is not found.

Comment on lines +376 to +380
// Expected errors:
// - access.DataNotFoundError if the collection is not found.
//
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Expected errors:
// - access.DataNotFoundError if the collection is not found.
//
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!
//
// Expected sentinel errors providing details to clients about failed requests:
// - access.DataNotFoundError if the collection is not found.

) error {
if !errors.Is(genericErr, storage.ErrNotFound) {
return genericErr
// Will return the original error, possibly wrapped with additional context.
Copy link
Member

@AlexHentschel AlexHentschel Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️❓ I am wondering if we need to consider the next layer of code here

Please confirm that my understanding here is correct:

Suggested change
// Will return the original error, possibly wrapped with additional context.
// Will return the original error, possibly wrapped with additional context.
// CAUTION: this function might return irrecoverable errors or generic fatal from the lower protocol layers

Lets take a look at a couple of places that are calling resolveHeightError:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for functions in this file, I would suggest to include the "caution"-block in their goDocs:

// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
//   - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
//   - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!

Comment on lines +32 to +34
// No errors are expected during normal operation.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// No errors are expected during normal operation.
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!

Comment on lines +53 to +56
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
//
// The block may or may not be finalized in the future; the client can retry later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// All errors can be considered benign. Exceptions are handled explicitly within the backend and are
// not propagated.
//
// The block may or may not be finalized in the future; the client can retry later.
// CAUTION: this layer SIMPLIFIES the ERROR HANDLING convention
// As documented in the [access.API], which we partially implement with this function
// - All errors returned by this API are guaranteed to be benign. The node can continue normal operations after such errors.
// - Hence, we MUST check here and crash on all errors *except* for those known to be benign in the present context!

}

return data, nil
}

// GetProtocolStateSnapshotByBlockID returns serializable Snapshot for a block, by blockID.
// The requested block must be finalized, otherwise an error is returned.
//
// Expected errors during normal operation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Expected errors during normal operation:
// Dedicated sentinel errors providing details to clients about failed requests:

}
return data, nil
}

// GetProtocolStateSnapshotByHeight returns serializable Snapshot by block height.
// The block must be finalized (otherwise the by-height query is ambiguous).
//
// Expected errors during normal operation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Expected errors during normal operation:
// Dedicated sentinel errors providing details to clients about failed requests:

snapshot := b.state.Final()
data, err := convert.SnapshotToBytes(snapshot)
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to convert snapshot to bytes: %v", err)
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at the moment, we do not guarantee that a snapshot can always be constructed for any finalized block (in other words, I think this call can fail

Suggested change
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err))
err = access.RequireErrorIs(ctx, err, protocol.ErrSealingSegmentBelowRootBlock, protocol.NewUnfinalizedSealingSegmentErrorf(""))
return nil, fmt.Errorf("snapshots might not be possible for every block: %w", err)

I think @jordanschalm knows if we would always be able to generate a snapshot for any block (including the root block when the node is freshly bootstrapped). Only if that is the case, we can use access.RequireNoError here .

// The block exists, but no block has been finalized at its height. Therefore, this block
// may be finalized in the future, and the client can retry.
return nil, access.NewPreconditionFailedError(
fmt.Errorf("failed to retrieve snapshot for block with height %d: block not finalized and is above finalized height",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Errorf("failed to retrieve snapshot for block with height %d: block not finalized and is above finalized height",
fmt.Errorf("failed to retrieve snapshot: block still pending finalization",

// storage.ErrNotFound is specifically NOT allowed since the snapshot's reference block must exist
// within the snapshot.
err = access.RequireErrorIs(ctx, err, state.ErrUnknownSnapshotReference)
return nil, access.NewDataNotFoundError("snapshot", fmt.Errorf("failed to retrieve a valid snapshot: block not found"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, access.NewDataNotFoundError("snapshot", fmt.Errorf("failed to retrieve a valid snapshot: block not found"))
return nil, access.NewDataNotFoundError("snapshot", fmt.Errorf("failed to retrieve snapshot: block not found"))

}

if blockIDFinalizedAtHeight != blockID {
// A different block than what was queried has been finalized at this height.
// Therefore, the queried block will never be finalized.
return nil, status.Errorf(codes.InvalidArgument,
"failed to retrieve snapshot for block: block not finalized and is below finalized height")
return nil, access.NewInvalidRequestError(fmt.Errorf("failed to retrieve snapshot for block: block not finalized and is below finalized height"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, access.NewInvalidRequestError(fmt.Errorf("failed to retrieve snapshot for block: block not finalized and is below finalized height"))
return nil, access.NewInvalidRequestError(fmt.Errorf("failed to retrieve snapshot: block orphaned"))

}

data, err := convert.SnapshotToBytes(snapshot)
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to convert snapshot to bytes: %v", err)
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err))
Copy link
Member

@AlexHentschel AlexHentschel Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we can guarantee that we are always be able to generate a snapshot for any block (including the root block when the node is freshly bootstrapped) we can use access.RequireNoError here (same as comment above)

Suggested change
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err))
err = access.RequireErrorIs(ctx, err, protocol.ErrSealingSegmentBelowRootBlock, protocol.NewUnfinalizedSealingSegmentErrorf(""))
return nil, fmt.Errorf("snapshots might not be possible for every block: %w", err)

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more functions that I noticed probably needing handling of exceptions

}

data, err := convert.SnapshotToBytes(snapshot)
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to convert snapshot to bytes: %v", err)
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we can guarantee that we are always be able to generate a snapshot for any block (including the root block when the node is freshly bootstrapped) we can use access.RequireNoError here (same as comment above)

Suggested change
return nil, access.RequireNoError(ctx, fmt.Errorf("failed to convert snapshot to bytes: %w", err))
err = access.RequireErrorIs(ctx, err, protocol.ErrSealingSegmentBelowRootBlock, protocol.NewUnfinalizedSealingSegmentErrorf(""))
return nil, fmt.Errorf("snapshots might not be possible for every block: %w", err)

Copy link
Member

@AlexHentschel AlexHentschel Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only interact with a single protocol-level component here:

transactions storage.Transactions

in addition to TransactionsLocalDataProvider, which I commented on below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not using any protocol components, so its probably fine as is (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors returned by these protocol components should ideally be checked for unexpected exceptions:

state protocol.State
collections storage.Collections
blocks storage.Blocks

in many cases, we just convert an exception and send a message to the client, e.g.

return nil, rpc.ConvertStorageError(err)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is another component, where we ideally would like to check errors returned by these lower-level protocol components 👇

executionReceipts storage.ExecutionReceipts
state protocol.State

and crash in case of unexpected exceptions

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Yurii's comment

clearly documented and defined somewhere.

eventually each function from "backend" that implements some endpoint must include RequireNoError/RequireErrorIs/RequireAccessError, this way we ensure that only benign errors are propagated to the caller.

Most of my comments are aimed at achieving this. Though, my comments are likely not an exhaustive list of all places where such error checks should be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants