trace: treat OTel export failures as non-fatal by ndeepak-baseten · Pull Request #6730 · moby/buildkit

ndeepak-baseten · 2026-04-28T22:30:30Z

Addresses #6747.

Three call sites pass context.TODO() (unbounded) into OTel operations that block on a dead collector:

cmd/buildctl/common/trace.go: tp.Shutdown in app.After
cmd/buildkitd/main.go: tracer/meter provider closers in app.After
control/control.go: Controller.Export synchronous span forward

Cap each with a 5-second deadline (context.WithTimeoutCause). Errors are non-fatal: discarded in shutdown paths, logged at Debug in Controller.Export. Controller.Export also stops returning Unavailable when no downstream collector is configured, since clients that always emit traces would otherwise see a spurious gRPC error.

Regression test in util/tracing/detect/shutdown_test.go reproduces the unbounded-context stall and verifies a bounded context returns promptly.

When OTEL_EXPORTER_OTLP_ENDPOINT is configured but the collector is unreachable, buildctl and buildkitd stall for up to 30 seconds (the SDK batch span processor's default ExportTimeout) during shutdown, and Controller.Export blocks the gRPC handler for the same duration on every client trace forward. Three call sites pass context.TODO() (unbounded) into operations that block on a dead OTLP endpoint: - cmd/buildctl/common/trace.go: tp.Shutdown in app.After - cmd/buildkitd/main.go: closer loop (TracerProvider, MeterProvider) in app.After - control/control.go: Controller.Export's synchronous ExportSpans Cap each path with a 5-second deadline (context.WithTimeoutCause) and treat errors as non-fatal: discard them in the shutdown paths, log at Debug in Controller.Export. Trace export is best-effort, and a missing or slow collector should never block shutdown or fail a build. The gRPC client itself stops returning Unavailable when the collector is absent, since clients that always emit traces would otherwise see a spurious error. Add a regression test in util/tracing/detect that reproduces the stall with context.TODO() and verifies a bounded context returns promptly. Fixes moby#4616 Signed-off-by: Deepak Nagaraj <deepak.nagaraj@baseten.co> Made-with: Cursor

jsternberg · 2026-05-06T18:50:28Z

Hi @ndeepak-baseten,

Thanks for the PR, but I'm not sure this is the correct solution for this. I think before we get into implementation or fixing, I'd prefer if you were able to give created a bug with instructions that you use to reproduce the issue.

This reminds me of one issue I attempted to fix but dropped because I didn't have an easy way to test it and I wasn't sure how much of an issue this still was here: #5912.

I also noticed that this PR seems to be largely AI generated with the pull request description also being AI generated. AI tends to create very verbose code changes and they are a bit hard to verify from a reviewer standpoint. In particular, the PR description is very wordy and it took me a decently long time to read it and understand it. Please try and filter out the outputs or write the description yourself in the future. It's much more courteous to my time if I'm able to concentrate on the actual pull request and the reasoning behind it rather than getting a list of testing methodology or that the DCO is signed. That last one is a thing the pull request status tells me.

ndeepak-baseten · 2026-05-06T23:15:38Z

@jsternberg Thanks for the feedback. Based on your comment, filed #6747 with a self-contained docker repro and trimmed the PR description. Let me know if there's anything else you'd like changed!

ndeepak-baseten · 2026-05-06T23:19:36Z

@jsternberg Missed this bit in my earlier comment: Re #5912: I see your closed PR took an async BatchSpanProcessor approach for Controller.Export, which arguably handles that path more cleanly than the deadline cap here. Meanwhile, the new repro should also give you the testbed you mentioned was missing. Happy to drop the Controller.Export portion of this PR in favor of reviving #5912 if you'd prefer.

github-actions Bot assigned ndeepak-baseten Apr 28, 2026

github-actions Bot added area/testing area/util area/buildkitd area/solver area/buildctl labels Apr 28, 2026

ndeepak-baseten marked this pull request as ready for review April 28, 2026 22:45

tonistiigi requested a review from jsternberg May 4, 2026 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trace: treat OTel export failures as non-fatal#6730

trace: treat OTel export failures as non-fatal#6730
ndeepak-baseten wants to merge 1 commit intomoby:masterfrom
basetenlabs:ndeepak/fix-otel-shutdown-stall-upstream

ndeepak-baseten commented Apr 28, 2026 •

edited

Loading

Uh oh!

jsternberg commented May 6, 2026

Uh oh!

ndeepak-baseten commented May 6, 2026

Uh oh!

ndeepak-baseten commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ndeepak-baseten commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsternberg commented May 6, 2026

Uh oh!

ndeepak-baseten commented May 6, 2026

Uh oh!

ndeepak-baseten commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ndeepak-baseten commented Apr 28, 2026 •

edited

Loading