Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove anti-pattern of ignoring irrecoverable error channel #7155

Merged
merged 6 commits into from
Mar 19, 2025

Conversation

peterargue
Copy link
Contributor

This PR removes all cases where we ignore the error channel for an irrecoverable context. There was only one case in application code in the vote aggregator. The component that received the context never threw using it, so there was no risk of suppressing errors. However, it risks issues in the future if we extended the logic and makes it more likely for this pattern to be repeated elsewhere.

All of the other changes are in tests. I updated them to use the mock instance that asserts no errors were thrown.

@peterargue peterargue requested a review from a team as a code owner March 14, 2025 03:24
if err := util.WaitError(errCh, ctx.Done()); err != nil {
parentCtx.Throw(err)
}
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if instead of spawning a go routine we replace <-collectors.Ready with:

if err := util.WaitError(errCh, collectors.Done()); err != nil {
	parentCtx.Throw(err)
}

This way it will be easier to reason when the errors are delivered and of order of operations in general.
Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would you put the call to cancel in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you thinking of switching it to something like this?

componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) {
	ctx, cancel := context.WithCancel(context.Background())
	signalerCtx, errCh := irrecoverable.WithSignaler(ctx)

	collectors.Start(signalerCtx)

	go func() {
		<-collectors.Ready()
		ready()
	
		wg.Wait()
		cancel()
	}()
	
	if err := util.WaitError(errCh, collectors.Done()); err != nil {
		parentCtx.Throw(err)
	}
})

I think it's OK to switch the logic, but in either case the WaitError is sitting in a different goroutine than the logic that throws the errors (by design) so order of operation is up to the scheduler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this:

componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) {
	// create new context which is not connected to parent
	// we need to ensure that our internal workers stop before asking
	// vote collectors to stop. We want to avoid delivering events to already stopped vote collectors
	ctx, cancel := context.WithCancel(context.Background())
	signalerCtx, errCh := irrecoverable.WithSignaler(ctx)
	// start vote collectors
	collectors.Start(signalerCtx)
	<-collectors.Ready()

	ready()

	// wait for internal workers to stop
	wg.Wait()
	// signal vote collectors to stop
	cancel()
	// wait for it to stop
	if err := util.WaitError(errCh, collectors.Done()); err != nil {
		parentCtx.Throw(err)
	}
})

Why? I think its hard to argue what will happen in case of shutdown to this code:

	go func() {
		if err := util.WaitError(errCh, ctx.Done()); err != nil {
			parentCtx.Throw(err)
		}
	}()

Basically thinking about a case where the scheduler doesn't give an opportunity to the detached goroutine to report the error while the parent goroutine will perform the shutdown. I've tried to sequence this code but it's also not ideal since in my version we get guarantee that each error will be reported but it will get reported only on shutdown, there is no way to signal that an error has happened before the actual shutdown has been requested.

What if we add a wait group into the mix to provide a guarantee that error reporting goroutine exists before commencing shutdown of worker. This way we establish happes-before relation between shutdown and error reporting.

componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) {
		// create new context which is not connected to parent
		// we need to ensure that our internal workers stop before asking
		// vote collectors to stop. We want to avoid delivering events to already stopped vote collectors
		ctx, cancel := context.WithCancel(context.Background())
		signalerCtx, errCh := irrecoverable.WithSignaler(ctx)

		// since we are breaking the connection between parentCtx and signalerCtx, we need to
		// explicitly rethrow any errors from signalerCtx to parentCtx, otherwise they are dropped.
		var errReportingDone sync.WaitGroup
		errReportingDone.Add(1)
		go func() {
			defer errReportingDone.Done()
			if err := util.WaitError(errCh, ctx.Done()); err != nil {
				parentCtx.Throw(err)
			}
		}()

		// start vote collectors
		collectors.Start(signalerCtx)
		<-collectors.Ready()

		ready()

		// wait for internal workers to stop
		wg.Wait()
		// signal vote collectors to stop
		cancel()
		// wait for it to stop
		collectors.Done()
		// ensure that error reporting is done
		errReportingDone.Wait()
	})

Copy link
Contributor Author

@peterargue peterargue Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first example, an irrecoverable could be thrown during initialization, so we'd end up blocking waiting for the collectors to be ready and never handle the errors. I think we need to keep the error handling and blocking in separate goroutines.

The second example is OK. I'd rather avoid an additional waitgroup so maybe we could just break the blocking logic out into its own gorountine. How about the changes I pushed to the PR? Do they address your concern?

Copy link
Member

@durkmurder durkmurder Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, how it looks now it's great can you just add more doc to explain what is going on? A more broader explanation so it's easier to follow for anyone who hasn't seen this conversation.

@codecov-commenter
Copy link

codecov-commenter commented Mar 14, 2025

Codecov Report

Attention: Patch coverage is 70.58824% with 5 lines in your changes missing coverage. Please review.

Project coverage is 41.26%. Comparing base (4cabd39) to head (9222c10).

Files with missing lines Patch % Lines
...nsensus/hotstuff/voteaggregator/vote_aggregator.go 80.00% 2 Missing and 1 partial ⚠️
engine/testutil/mock/nodes.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7155      +/-   ##
==========================================
- Coverage   41.27%   41.26%   -0.02%     
==========================================
  Files        2170     2170              
  Lines      190047   190050       +3     
==========================================
- Hits        78438    78420      -18     
- Misses     105070   105086      +16     
- Partials     6539     6544       +5     
Flag Coverage Δ
unittests 41.26% <70.58%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, how it looks now it's great can you just add more doc to explain what is going on? A more broader explanation so it's easier to follow for anyone who hasn't seen this conversation.

@peterargue peterargue enabled auto-merge March 19, 2025 23:21
@peterargue peterargue added this pull request to the merge queue Mar 19, 2025
Merged via the queue into master with commit 82d7f1d Mar 19, 2025
56 checks passed
@peterargue peterargue deleted the peter/update-vote-agg-irrecoverable branch March 19, 2025 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants