Skip to content

trace: treat OTel export failures as non-fatal#6730

Open
ndeepak-baseten wants to merge 1 commit intomoby:masterfrom
basetenlabs:ndeepak/fix-otel-shutdown-stall-upstream
Open

trace: treat OTel export failures as non-fatal#6730
ndeepak-baseten wants to merge 1 commit intomoby:masterfrom
basetenlabs:ndeepak/fix-otel-shutdown-stall-upstream

Conversation

@ndeepak-baseten
Copy link
Copy Markdown

@ndeepak-baseten ndeepak-baseten commented Apr 28, 2026

Addresses #6747.

Three call sites pass context.TODO() (unbounded) into OTel operations that block on a dead collector:

  • cmd/buildctl/common/trace.go: tp.Shutdown in app.After
  • cmd/buildkitd/main.go: tracer/meter provider closers in app.After
  • control/control.go: Controller.Export synchronous span forward

Cap each with a 5-second deadline (context.WithTimeoutCause). Errors are non-fatal: discarded in shutdown paths, logged at Debug in Controller.Export. Controller.Export also stops returning Unavailable when no downstream collector is configured, since clients that always emit traces would otherwise see a spurious gRPC error.

Regression test in util/tracing/detect/shutdown_test.go reproduces the unbounded-context stall and verifies a bounded context returns promptly.

When OTEL_EXPORTER_OTLP_ENDPOINT is configured but the collector is
unreachable, buildctl and buildkitd stall for up to 30 seconds (the
SDK batch span processor's default ExportTimeout) during shutdown,
and Controller.Export blocks the gRPC handler for the same duration
on every client trace forward.

Three call sites pass context.TODO() (unbounded) into operations that
block on a dead OTLP endpoint:

  - cmd/buildctl/common/trace.go: tp.Shutdown in app.After
  - cmd/buildkitd/main.go: closer loop (TracerProvider, MeterProvider)
    in app.After
  - control/control.go: Controller.Export's synchronous ExportSpans

Cap each path with a 5-second deadline (context.WithTimeoutCause) and
treat errors as non-fatal: discard them in the shutdown paths, log at
Debug in Controller.Export. Trace export is best-effort, and a missing
or slow collector should never block shutdown or fail a build. The
gRPC client itself stops returning Unavailable when the collector is
absent, since clients that always emit traces would otherwise see a
spurious error.

Add a regression test in util/tracing/detect that reproduces the stall
with context.TODO() and verifies a bounded context returns promptly.

Fixes moby#4616

Signed-off-by: Deepak Nagaraj <deepak.nagaraj@baseten.co>
Made-with: Cursor
@jsternberg
Copy link
Copy Markdown
Collaborator

Hi @ndeepak-baseten,

Thanks for the PR, but I'm not sure this is the correct solution for this. I think before we get into implementation or fixing, I'd prefer if you were able to give created a bug with instructions that you use to reproduce the issue.

This reminds me of one issue I attempted to fix but dropped because I didn't have an easy way to test it and I wasn't sure how much of an issue this still was here: #5912.

I also noticed that this PR seems to be largely AI generated with the pull request description also being AI generated. AI tends to create very verbose code changes and they are a bit hard to verify from a reviewer standpoint. In particular, the PR description is very wordy and it took me a decently long time to read it and understand it. Please try and filter out the outputs or write the description yourself in the future. It's much more courteous to my time if I'm able to concentrate on the actual pull request and the reasoning behind it rather than getting a list of testing methodology or that the DCO is signed. That last one is a thing the pull request status tells me.

@ndeepak-baseten
Copy link
Copy Markdown
Author

@jsternberg Thanks for the feedback. Based on your comment, filed #6747 with a self-contained docker repro and trimmed the PR description. Let me know if there's anything else you'd like changed!

@ndeepak-baseten
Copy link
Copy Markdown
Author

@jsternberg Missed this bit in my earlier comment: Re #5912: I see your closed PR took an async BatchSpanProcessor approach for Controller.Export, which arguably handles that path more cleanly than the deadline cap here. Meanwhile, the new repro should also give you the testbed you mentioned was missing. Happy to drop the Controller.Export portion of this PR in favor of reviving #5912 if you'd prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants