trace: treat OTel export failures as non-fatal#6730
trace: treat OTel export failures as non-fatal#6730ndeepak-baseten wants to merge 1 commit intomoby:masterfrom
Conversation
When OTEL_EXPORTER_OTLP_ENDPOINT is configured but the collector is
unreachable, buildctl and buildkitd stall for up to 30 seconds (the
SDK batch span processor's default ExportTimeout) during shutdown,
and Controller.Export blocks the gRPC handler for the same duration
on every client trace forward.
Three call sites pass context.TODO() (unbounded) into operations that
block on a dead OTLP endpoint:
- cmd/buildctl/common/trace.go: tp.Shutdown in app.After
- cmd/buildkitd/main.go: closer loop (TracerProvider, MeterProvider)
in app.After
- control/control.go: Controller.Export's synchronous ExportSpans
Cap each path with a 5-second deadline (context.WithTimeoutCause) and
treat errors as non-fatal: discard them in the shutdown paths, log at
Debug in Controller.Export. Trace export is best-effort, and a missing
or slow collector should never block shutdown or fail a build. The
gRPC client itself stops returning Unavailable when the collector is
absent, since clients that always emit traces would otherwise see a
spurious error.
Add a regression test in util/tracing/detect that reproduces the stall
with context.TODO() and verifies a bounded context returns promptly.
Fixes moby#4616
Signed-off-by: Deepak Nagaraj <deepak.nagaraj@baseten.co>
Made-with: Cursor
|
Hi @ndeepak-baseten, Thanks for the PR, but I'm not sure this is the correct solution for this. I think before we get into implementation or fixing, I'd prefer if you were able to give created a bug with instructions that you use to reproduce the issue. This reminds me of one issue I attempted to fix but dropped because I didn't have an easy way to test it and I wasn't sure how much of an issue this still was here: #5912. I also noticed that this PR seems to be largely AI generated with the pull request description also being AI generated. AI tends to create very verbose code changes and they are a bit hard to verify from a reviewer standpoint. In particular, the PR description is very wordy and it took me a decently long time to read it and understand it. Please try and filter out the outputs or write the description yourself in the future. It's much more courteous to my time if I'm able to concentrate on the actual pull request and the reasoning behind it rather than getting a list of testing methodology or that the DCO is signed. That last one is a thing the pull request status tells me. |
|
@jsternberg Thanks for the feedback. Based on your comment, filed #6747 with a self-contained docker repro and trimmed the PR description. Let me know if there's anything else you'd like changed! |
|
@jsternberg Missed this bit in my earlier comment: Re #5912: I see your closed PR took an async BatchSpanProcessor approach for Controller.Export, which arguably handles that path more cleanly than the deadline cap here. Meanwhile, the new repro should also give you the testbed you mentioned was missing. Happy to drop the Controller.Export portion of this PR in favor of reviving #5912 if you'd prefer. |
Addresses #6747.
Three call sites pass
context.TODO()(unbounded) into OTel operations that block on a dead collector:cmd/buildctl/common/trace.go:tp.Shutdowninapp.Aftercmd/buildkitd/main.go: tracer/meter provider closers inapp.Aftercontrol/control.go:Controller.Exportsynchronous span forwardCap each with a 5-second deadline (
context.WithTimeoutCause). Errors are non-fatal: discarded in shutdown paths, logged atDebuginController.Export.Controller.Exportalso stops returningUnavailablewhen no downstream collector is configured, since clients that always emit traces would otherwise see a spurious gRPC error.Regression test in
util/tracing/detect/shutdown_test.goreproduces the unbounded-context stall and verifies a bounded context returns promptly.