Remove anti-pattern of ignoring irrecoverable error channel #7155

peterargue · 2025-03-14T03:24:17Z

This PR removes all cases where we ignore the error channel for an irrecoverable context. There was only one case in application code in the vote aggregator. The component that received the context never threw using it, so there was no risk of suppressing errors. However, it risks issues in the future if we extended the logic and makes it more likely for this pattern to be repeated elsewhere.

All of the other changes are in tests. I updated them to use the mock instance that asserts no errors were thrown.

peterargue · 2025-03-14T03:25:05Z

consensus/hotstuff/voteaggregator/vote_aggregator.go

+			if err := util.WaitError(errCh, ctx.Done()); err != nil {
+				parentCtx.Throw(err)
+			}
+		}()


this is the main change.

what if instead of spawning a go routine we replace <-collectors.Ready with:

if err := util.WaitError(errCh, collectors.Done()); err != nil { parentCtx.Throw(err) }

This way it will be easier to reason when the errors are delivered and of order of operations in general.
Thoughts?

where would you put the call to cancel in that case?

are you thinking of switching it to something like this?

componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) { ctx, cancel := context.WithCancel(context.Background()) signalerCtx, errCh := irrecoverable.WithSignaler(ctx) collectors.Start(signalerCtx) go func() { <-collectors.Ready() ready() wg.Wait() cancel() }() if err := util.WaitError(errCh, collectors.Done()); err != nil { parentCtx.Throw(err) } })

I think it's OK to switch the logic, but in either case the WaitError is sitting in a different goroutine than the logic that throws the errors (by design) so order of operation is up to the scheduler

I was thinking about this:

componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) { // create new context which is not connected to parent // we need to ensure that our internal workers stop before asking // vote collectors to stop. We want to avoid delivering events to already stopped vote collectors ctx, cancel := context.WithCancel(context.Background()) signalerCtx, errCh := irrecoverable.WithSignaler(ctx) // start vote collectors collectors.Start(signalerCtx) <-collectors.Ready() ready() // wait for internal workers to stop wg.Wait() // signal vote collectors to stop cancel() // wait for it to stop if err := util.WaitError(errCh, collectors.Done()); err != nil { parentCtx.Throw(err) } })

Why? I think its hard to argue what will happen in case of shutdown to this code:

go func() { if err := util.WaitError(errCh, ctx.Done()); err != nil { parentCtx.Throw(err) } }()

Basically thinking about a case where the scheduler doesn't give an opportunity to the detached goroutine to report the error while the parent goroutine will perform the shutdown. I've tried to sequence this code but it's also not ideal since in my version we get guarantee that each error will be reported but it will get reported only on shutdown, there is no way to signal that an error has happened before the actual shutdown has been requested.

What if we add a wait group into the mix to provide a guarantee that error reporting goroutine exists before commencing shutdown of worker. This way we establish happes-before relation between shutdown and error reporting.

componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) { // create new context which is not connected to parent // we need to ensure that our internal workers stop before asking // vote collectors to stop. We want to avoid delivering events to already stopped vote collectors ctx, cancel := context.WithCancel(context.Background()) signalerCtx, errCh := irrecoverable.WithSignaler(ctx) // since we are breaking the connection between parentCtx and signalerCtx, we need to // explicitly rethrow any errors from signalerCtx to parentCtx, otherwise they are dropped. var errReportingDone sync.WaitGroup errReportingDone.Add(1) go func() { defer errReportingDone.Done() if err := util.WaitError(errCh, ctx.Done()); err != nil { parentCtx.Throw(err) } }() // start vote collectors collectors.Start(signalerCtx) <-collectors.Ready() ready() // wait for internal workers to stop wg.Wait() // signal vote collectors to stop cancel() // wait for it to stop collectors.Done() // ensure that error reporting is done errReportingDone.Wait() })

In the first example, an irrecoverable could be thrown during initialization, so we'd end up blocking waiting for the collectors to be ready and never handle the errors. I think we need to keep the error handling and blocking in separate goroutines.

The second example is OK. I'd rather avoid an additional waitgroup so maybe we could just break the blocking logic out into its own gorountine. How about the changes I pushed to the PR? Do they address your concern?

Thanks, how it looks now it's great can you just add more doc to explain what is going on? A more broader explanation so it's easier to follow for anyone who hasn't seen this conversation.

codecov-commenter · 2025-03-14T03:27:35Z

Codecov Report

Attention: Patch coverage is 70.58824% with 5 lines in your changes missing coverage. Please review.

Project coverage is 41.26%. Comparing base (4cabd39) to head (9222c10).

Files with missing lines	Patch %	Lines
...nsensus/hotstuff/voteaggregator/vote_aggregator.go	80.00%	2 Missing and 1 partial ⚠️
engine/testutil/mock/nodes.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7155      +/-   ##
==========================================
- Coverage   41.27%   41.26%   -0.02%     
==========================================
  Files        2170     2170              
  Lines      190047   190050       +3     
==========================================
- Hits        78438    78420      -18     
- Misses     105070   105086      +16     
- Partials     6539     6544       +5

Flag	Coverage Δ
unittests	`41.26% <70.58%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

durkmurder

Thanks, how it looks now it's great can you just add more doc to explain what is going on? A more broader explanation so it's easier to follow for anyone who hasn't seen this conversation.

Remove anti-pattern of ignoring irrecoverable error channel

c0059c8

peterargue requested a review from a team as a code owner March 14, 2025 03:24

peterargue commented Mar 14, 2025

View reviewed changes

jordanschalm approved these changes Mar 14, 2025

View reviewed changes

peterargue added 3 commits March 18, 2025 12:19

switch to keep error handling in main gorountine

6bf6a6c

Merge branch 'master' into peter/update-vote-agg-irrecoverable

ba720f5

handle ready when parent context is canceled

7098e40

durkmurder approved these changes Mar 19, 2025

View reviewed changes

peterargue added 2 commits March 19, 2025 16:21

expand comments about error handling and concurrency

552d1e6

Merge branch 'master' into peter/update-vote-agg-irrecoverable

9222c10

peterargue enabled auto-merge March 19, 2025 23:21

peterargue added this pull request to the merge queue Mar 19, 2025

Merged via the queue into master with commit 82d7f1d Mar 19, 2025
56 checks passed

peterargue deleted the peter/update-vote-agg-irrecoverable branch March 19, 2025 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove anti-pattern of ignoring irrecoverable error channel #7155

Remove anti-pattern of ignoring irrecoverable error channel #7155

peterargue commented Mar 14, 2025

peterargue Mar 14, 2025

durkmurder Mar 14, 2025

peterargue Mar 14, 2025

peterargue Mar 14, 2025

durkmurder Mar 17, 2025

peterargue Mar 18, 2025 •

edited

Loading

durkmurder Mar 19, 2025 •

edited

Loading

codecov-commenter commented Mar 14, 2025 •

edited

Loading

durkmurder left a comment •

edited

Loading

Remove anti-pattern of ignoring irrecoverable error channel #7155

Remove anti-pattern of ignoring irrecoverable error channel #7155

Conversation

peterargue commented Mar 14, 2025

peterargue Mar 14, 2025

Choose a reason for hiding this comment

durkmurder Mar 14, 2025

Choose a reason for hiding this comment

peterargue Mar 14, 2025

Choose a reason for hiding this comment

peterargue Mar 14, 2025

Choose a reason for hiding this comment

durkmurder Mar 17, 2025

Choose a reason for hiding this comment

peterargue Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

durkmurder Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Mar 14, 2025 • edited Loading

Codecov Report

durkmurder left a comment • edited Loading

Choose a reason for hiding this comment

peterargue Mar 18, 2025 •

edited

Loading

durkmurder Mar 19, 2025 •

edited

Loading

codecov-commenter commented Mar 14, 2025 •

edited

Loading

durkmurder left a comment •

edited

Loading