Skip to content

solver/llbsolver: emit native build-completion metrics#6736

Open
asbarron wants to merge 2 commits intomoby:masterfrom
asbarron:asbarron/buildkit-build-metrics
Open

solver/llbsolver: emit native build-completion metrics#6736
asbarron wants to merge 2 commits intomoby:masterfrom
asbarron:asbarron/buildkit-build-metrics

Conversation

@asbarron
Copy link
Copy Markdown

@asbarron asbarron commented May 3, 2026

Threads the existing OTEL MeterProvider through llbsolver.Opt and emits three build-event instruments from the recordBuildHistory finalizer:

  • buildkit.builds (counter; labels: status, error_code)
  • buildkit.builds.steps (counter; labels: kind)
  • buildkit.build.duration (Base2 exponential histogram; labels: status)

The duration histogram uses an exponential aggregation, rendered as a Prometheus native histogram by the existing exporter, to avoid the "tens of millions of series" cardinality blow-up reported by #5777.

MeterProvider is passed explicitly through the constructor; buildkit policy (from the #4957 review) prohibits relying on the OTel global provider in library packages.

error_code uses gRPC codes.Code.String() for a bounded set; rec.Error.Message is intentionally never used as a label.

The frontend label is intentionally omitted because client.Build clears req.Frontend on the wire, so the field is empty for every caller that goes through the gateway-client API (buildctl, buildx). The metric is forward-compatible with a future buildkit change that populates rec.Frontend on that path.

A follow-up PR will add observable gauges for worker count and cache state, and an operator guide at docs/metrics.md.

Refs #1544; addresses discussion #5777.

@asbarron asbarron marked this pull request as ready for review May 3, 2026 21:04
@tonistiigi tonistiigi requested a review from jsternberg May 4, 2026 17:38
@github-actions github-actions Bot added the area/dependencies Pull requests that update a dependency file label May 5, 2026
asbarron added 2 commits May 5, 2026 15:10
Threads the existing OTEL MeterProvider through llbsolver.Opt and emits
three build-event instruments from the recordBuildHistory finalizer:

  - buildkit.builds (counter; labels: status, error_code)
  - buildkit.builds.steps (counter; labels: kind)
  - buildkit.build.duration (Base2 exponential histogram; labels: status)

The duration histogram uses an exponential aggregation, rendered as a
Prometheus native histogram by the existing exporter, to avoid the
"tens of millions of series" cardinality blow-up reported in moby#5777.

MeterProvider is passed explicitly through the constructor — buildkit
policy (per the moby#4957 review) prohibits relying on the OTel global
provider in library packages.

error_code uses gRPC codes.Code.String() for a bounded set;
rec.Error.Message is intentionally never used as a label. The frontend
label is intentionally omitted — client.Build clears req.Frontend on
the wire, so the field is empty for every caller that goes through the
gateway-client API (buildctl, buildx). The metric is forward-compatible
with a future buildkit change that populates rec.Frontend on that path.

A follow-up PR will add observable gauges for worker count and cache
state, plus an operator guide at docs/metrics.md.

Refs moby#1544; addresses discussion moby#5777.

Signed-off-by: Ava Barron <abarron@coreweave.com>
Signed-off-by: Ava Barron <abarron@coreweave.com>
@asbarron asbarron force-pushed the asbarron/buildkit-build-metrics branch from 06abd34 to c2a5c1d Compare May 5, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant