-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove anti-pattern of ignoring irrecoverable error channel #7155
Conversation
if err := util.WaitError(errCh, ctx.Done()); err != nil { | ||
parentCtx.Throw(err) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if instead of spawning a go routine we replace <-collectors.Ready
with:
if err := util.WaitError(errCh, collectors.Done()); err != nil {
parentCtx.Throw(err)
}
This way it will be easier to reason when the errors are delivered and of order of operations in general.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where would you put the call to cancel in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you thinking of switching it to something like this?
componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) {
ctx, cancel := context.WithCancel(context.Background())
signalerCtx, errCh := irrecoverable.WithSignaler(ctx)
collectors.Start(signalerCtx)
go func() {
<-collectors.Ready()
ready()
wg.Wait()
cancel()
}()
if err := util.WaitError(errCh, collectors.Done()); err != nil {
parentCtx.Throw(err)
}
})
I think it's OK to switch the logic, but in either case the WaitError
is sitting in a different goroutine than the logic that throws the errors (by design) so order of operation is up to the scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this:
componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) {
// create new context which is not connected to parent
// we need to ensure that our internal workers stop before asking
// vote collectors to stop. We want to avoid delivering events to already stopped vote collectors
ctx, cancel := context.WithCancel(context.Background())
signalerCtx, errCh := irrecoverable.WithSignaler(ctx)
// start vote collectors
collectors.Start(signalerCtx)
<-collectors.Ready()
ready()
// wait for internal workers to stop
wg.Wait()
// signal vote collectors to stop
cancel()
// wait for it to stop
if err := util.WaitError(errCh, collectors.Done()); err != nil {
parentCtx.Throw(err)
}
})
Why? I think its hard to argue what will happen in case of shutdown to this code:
go func() {
if err := util.WaitError(errCh, ctx.Done()); err != nil {
parentCtx.Throw(err)
}
}()
Basically thinking about a case where the scheduler doesn't give an opportunity to the detached goroutine to report the error while the parent goroutine will perform the shutdown. I've tried to sequence this code but it's also not ideal since in my version we get guarantee that each error will be reported but it will get reported only on shutdown, there is no way to signal that an error has happened before the actual shutdown has been requested.
What if we add a wait group into the mix to provide a guarantee that error reporting goroutine exists before commencing shutdown of worker. This way we establish happes-before relation between shutdown and error reporting.
componentBuilder.AddWorker(func(parentCtx irrecoverable.SignalerContext, ready component.ReadyFunc) {
// create new context which is not connected to parent
// we need to ensure that our internal workers stop before asking
// vote collectors to stop. We want to avoid delivering events to already stopped vote collectors
ctx, cancel := context.WithCancel(context.Background())
signalerCtx, errCh := irrecoverable.WithSignaler(ctx)
// since we are breaking the connection between parentCtx and signalerCtx, we need to
// explicitly rethrow any errors from signalerCtx to parentCtx, otherwise they are dropped.
var errReportingDone sync.WaitGroup
errReportingDone.Add(1)
go func() {
defer errReportingDone.Done()
if err := util.WaitError(errCh, ctx.Done()); err != nil {
parentCtx.Throw(err)
}
}()
// start vote collectors
collectors.Start(signalerCtx)
<-collectors.Ready()
ready()
// wait for internal workers to stop
wg.Wait()
// signal vote collectors to stop
cancel()
// wait for it to stop
collectors.Done()
// ensure that error reporting is done
errReportingDone.Wait()
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first example, an irrecoverable could be thrown during initialization, so we'd end up blocking waiting for the collectors to be ready and never handle the errors. I think we need to keep the error handling and blocking in separate goroutines.
The second example is OK. I'd rather avoid an additional waitgroup so maybe we could just break the blocking logic out into its own gorountine. How about the changes I pushed to the PR? Do they address your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, how it looks now it's great can you just add more doc to explain what is going on? A more broader explanation so it's easier to follow for anyone who hasn't seen this conversation.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7155 +/- ##
==========================================
- Coverage 41.27% 41.26% -0.02%
==========================================
Files 2170 2170
Lines 190047 190050 +3
==========================================
- Hits 78438 78420 -18
- Misses 105070 105086 +16
- Partials 6539 6544 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, how it looks now it's great can you just add more doc to explain what is going on? A more broader explanation so it's easier to follow for anyone who hasn't seen this conversation.
This PR removes all cases where we ignore the error channel for an
irrecoverable
context. There was only one case in application code in the vote aggregator. The component that received the context never threw using it, so there was no risk of suppressing errors. However, it risks issues in the future if we extended the logic and makes it more likely for this pattern to be repeated elsewhere.All of the other changes are in tests. I updated them to use the mock instance that asserts no errors were thrown.