Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Access] Properly handle subscription errors in data providers #7046

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

illia-malachyn
Copy link
Contributor

@illia-malachyn illia-malachyn commented Feb 17, 2025

Distinguish between context.Canceled errors originating from the streamer and those triggered by the DataProvider’s Close() method.

  1. I Added closeChan that is used in DataProvider.Close() method to indicate DataProvider.Run() that users of data providers (WebSocket controller in our case) want to finish receiving data.
  2. HandleSubscription() function is replaced by run() function that is aware of closeChan. I made a new function for it because HandleSubscription() is widely used in the access package (HandleRPCSubscription has 22 usages atm).
  3. run() returns nil if closeChan was closed. HandleSubscription returned ctx.Canceled which lead to confusion as ctx.Canceled could come from 2 sources (streamer and websocket controller).
  4. I did a little refactoring to a bunch of sendResponse() functions to make it more readable.

Closes #7040 #7047

Distinguish between `context.Canceled` errors originating from the
streamer and those triggered by the DataProvider’s `Close()` method.
Use `wasClosedByClient()` to suppress expected cancellations while
propagating unexpected ones
@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 54 lines in your changes missing coverage. Please review.

Project coverage is 41.27%. Comparing base (4cabd39) to head (7351588).

Files with missing lines Patch % Lines
engine/access/rest/common/models/block.go 47.27% 26 Missing and 3 partials ⚠️
...ss/rest/websockets/data_providers/base_provider.go 71.79% 8 Missing and 3 partials ⚠️
...ckets/data_providers/mock/data_provider_factory.go 25.00% 3 Missing and 3 partials ⚠️
...ockets/data_providers/account_statuses_provider.go 94.11% 2 Missing and 1 partial ⚠️
...st/websockets/data_providers/mock/data_provider.go 50.00% 1 Missing and 1 partial ⚠️
.../rest/websockets/data_providers/args_validation.go 80.00% 1 Missing ⚠️
.../rest/websockets/data_providers/blocks_provider.go 96.87% 1 Missing ⚠️
.../rest/websockets/data_providers/events_provider.go 97.95% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #7046    +/-   ##
========================================
  Coverage   41.27%   41.27%            
========================================
  Files        2170     2170            
  Lines      190047   190154   +107     
========================================
+ Hits        78438    78484    +46     
- Misses     105070   105122    +52     
- Partials     6539     6548     +9     
Flag Coverage Δ
unittests 41.27% <85.71%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Guitarheroua
Guitarheroua previously approved these changes Feb 17, 2025
Copy link
Contributor

@Guitarheroua Guitarheroua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@illia-malachyn
Copy link
Contributor Author

illia-malachyn commented Feb 17, 2025

After discussing with Yurii, we agreed to use a different approach here.

@illia-malachyn illia-malachyn marked this pull request as draft February 17, 2025 15:02
@illia-malachyn illia-malachyn changed the title Properly handle subscription errors in data providers [DRAFT] [Access] Properly handle subscription errors in data providers Feb 17, 2025
@Guitarheroua Guitarheroua dismissed their stale review February 18, 2025 13:43

As this will changed, will re-aprove final implementation

We use it to distinguish place where cxt.Canceled
error comes from.

Also, I refactored each data provider's Run()
function. Now it's more readable and clear.
@illia-malachyn illia-malachyn marked this pull request as ready for review February 18, 2025 18:15
@illia-malachyn illia-malachyn changed the title [DRAFT] [Access] Properly handle subscription errors in data providers [Access] Properly handle subscription errors in data providers Feb 19, 2025
Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this refactor needed? can you point out the important changes, it's not clear to me

@illia-malachyn
Copy link
Contributor Author

why is this refactor needed? can you point out the important changes, it's not clear to me

Hey. I updated PR's description. Added a context of what has been done and why.

@illia-malachyn
Copy link
Contributor Author

@peterargue I also pointed out the most important lines of code

Copy link
Contributor

@Guitarheroua Guitarheroua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first round of review - DONE! I have a few small comments.

Copy link
Contributor

@Guitarheroua Guitarheroua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@illia-malachyn illia-malachyn marked this pull request as draft February 27, 2025 18:31
@illia-malachyn illia-malachyn marked this pull request as ready for review February 27, 2025 18:31
@@ -56,4 +63,63 @@ func (b *baseDataProvider) Arguments() models.Arguments {
// No errors are expected during normal operations.
func (b *baseDataProvider) Close() {
b.cancel()
b.closedFlag.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this needed? this type of pattern often points to a design problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just use the context's Done() channel for closedChan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of closedFlag ensures that the done channel is closed exactly once, even if Close() is called multiple times.

We expect clients to call Run() and Close() from different goroutines—this makes sense because Run() is blocking, and clients need to call Close() concurrently to stop receiving data.

Since a client might have multiple exit conditions triggering Close(), we handle the idempotency internally for simplicity rather than requiring clients to manage it themselves. This prevents potential panics from closing an already-closed channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we can use a ctx over a done channel here. However, I'm not sure how to do so correctly without storing ctx in struct. Will ping you once it is ready

Copy link
Contributor Author

@illia-malachyn illia-malachyn Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we cannot use a ctx safely. Read a PR & linked issue description for it.

As discussed at the meeting, the usage of context.Context is possible but has disadvantages. If we use it, we may run in the case where we suppress a real error from an access node and silently shut down without notifying a user.

@peterargue I thought maybe it is still better to leave the done channel? In this case, we fix the potential issue and we don't have to worry about it. It is basically a couple of lines of code. I think it is worth it for such a 'complex' problem. I pushed a refactored version with a done channel. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read a PR & linked issue description for it.

@illia-malachyn can you add a pointer to that PR/issue here.

In this case, we fix the potential issue and we don't have to worry about it.

which issue are you referring to?

I'm not clear how using a separate done channel is different from a context's done channel wrt lost errors. It seems like it's possible in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this PR's description and the related issue. I'll describe it once again.

Let's assume we're in this place where we got the context.Canceled error from a subscription.
https://github.com/onflow/flow-go/pull/7046/files#diff-5cbca3503bb00261318db4a8b8b1714447f7348a4d87e11aaf8b37b43b36e2bbR120-R128

We cannot know exactly what happened - is it a controller who initiated a shutdown or has something happened in the subscription? Depending on who initiated a shutdown, we have to behave differently. if it is a streamer's/subscription's error, we have to propagate it and react to it. If it is the controller, we wanna suppress it and return nil.

As we cannot differentiate who created this error, I suggest using a done channel for the graceful shutdown of a provider. In such a case, we treat every ctx.Canceled error as sth bad and react appropriately. If done is closed, we return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the above link was created well. I'm speaking of this place in the run() function

		case value, ok := <-subscription.Channel():
			if !ok {
				err := subscription.Err()
				if err != nil {
					return fmt.Errorf("subscription finished with error: %w", err)  // who cancelled the context??
				}

				return nil
			}
	```

Copy link
Contributor Author

@illia-malachyn illia-malachyn Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still may run into a race condition if 2 events happen concurrently (controller shutdowns provider and error happening in subscription/streamer) but it is fine as it is a "natural" concurrency.

With done channel, we disambiguate this place with an error and can handle a request for shutdown gracefully when there's no concurrency but normal flow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot know exactly what happened - is it a controller who initiated a shutdown or has something happened in the subscription?

Looking over the code, I think we're over thinking this.

The only way the subscription's error is context.Canceled is if the context passed to the streamer was canceled. That context is passed by the data provider, which originally comes from the controller. This means that if the error is context canceled, either:

  1. The controller called the provider's Close() method.
  2. The controller's parent context was cancelled, signalling the node is shutting down.

I think a simple check if the error is context canceled in the conditional should be all that's needed.

I think it's OK for now to handle server shutdowns by gracefully shutting down the connection. Ideally, we'd signal the shutdown with an error to the user, but that can come later. One note from looking over the code, when the controller's context is canceled (server shutdown), we start dropping all messages, so any error response would be dropped anyway:

func (c *Controller) writeResponse(ctx context.Context, response interface{}) {
select {
case <-ctx.Done():
return
case c.multiplexedStream <- response:
}
}

We still may run into a race condition if 2 events happen concurrently (controller shutdowns provider and error happening in subscription/streamer)

I don't think it's possible for there to be 2 errors happening simultaneously. The streamer is single threaded and handles each message serially. If the context is canceled, it will detect that before trying processes more data. If the subscription encountered an error at the same time as the context was canceled, the streamer would handle that error before checking the context at the beginning of the loop. If there are any codepaths that violate this, we should fix them.

my proposal is to simply use

value, ok := <-subscription.Channel()
if !ok {
	err := subscription.Err()
	if err != nil && !errors.Is(err, context.Canceled) {
		return fmt.Errorf("subscription finished with error: %w", err)
	}

	return nil
}
...

this removes the need for the extra done channel, and we can simply call cancel in the Close() method

- Moved creation and start of subscription and streamer
to Run() function instead of having it in constructor

- Ged rid of ctx in constructor. Moved it to Run()
function

- Refactored the structure of the base provider
Comment on lines +89 to +91
// set to nils in case Run() called for the second time
p.messageIndex = counters.NewMonotonicCounter(0)
p.blocksSinceLastMessage = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like defensive checks against a developer incorrectly calling Run multiple times. If that's the concern, I think we should just ensure Run is only called once and skip reinitializing here.

You could use an atomic bool with a CompareAndSwap check done at the beginning of the function.

Copy link
Contributor Author

@illia-malachyn illia-malachyn Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is the question of whether we allow clients to reuse a provider after Run()/Close() pair. With the current code, we allow do so and it works just fine. You wanna restrict it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not allow it.

messageIndex: counters.NewMonotonicCounter(0),
blocksSinceLastMessage: 0,
stateStreamApi: stateStreamApi,
}, nil
}

// Run starts processing the subscription for events and handles responses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Run starts processing the subscription for events and handles responses.
// Run starts processing the subscription for events and handles responses.
// Must only be called once

@@ -56,4 +63,63 @@ func (b *baseDataProvider) Arguments() models.Arguments {
// No errors are expected during normal operations.
func (b *baseDataProvider) Close() {
b.cancel()
b.closedFlag.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read a PR & linked issue description for it.

@illia-malachyn can you add a pointer to that PR/issue here.

In this case, we fix the potential issue and we don't have to worry about it.

which issue are you referring to?

I'm not clear how using a separate done channel is different from a context's done channel wrt lost errors. It seems like it's possible in both cases.

b.doneOnce.Do(func() {
close(b.done)
})
b.subscriptionState.cancelSubscriptionContext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from #7046 (comment)

We expect clients to call Run() and Close() from different goroutines—this makes sense because Run() is blocking, and clients need to call Close() concurrently to stop receiving data.

Since Close() is called in a separate gorountine, it's possible for it to be called before subscriptionState is set in Run(). e.g. if the client disconnects while subscribing. This is likely to cause occasional crashes.

I think we need to go back to passing the context during initialization so we can initialize the cancel before returning the object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Access] Data providers should wrap context.Canceled error
4 participants