Add error_message field to bundle deploy telemetry#4793
Add error_message field to bundle deploy telemetry#4793shreyas-goenka wants to merge 1 commit intomainfrom
Conversation
c7ee5c8 to
cb59f1d
Compare
|
Commit: 1a8ff40
18 interesting tests: 9 SKIP, 7 KNOWN, 2 flaky
Top 20 slowest tests (at least 2 minutes):
|
Track the first error diagnostic summary encountered during bundle deploy in telemetry. Move telemetry logging into a defer so it's always captured, even when deploy fails. Co-authored-by: Isaac
cb59f1d to
1a8ff40
Compare
Suggested reviewersBased on git history of the changed files, these people are best suited to review:
Confidence: high Eligible reviewersBased on CODEOWNERS, these people or teams could also review: @andrewnester, @anton-107, @pietern, @simonfaltum Suggestions based on git history of 5 changed files (5 scored). See CODEOWNERS for path-specific ownership rules. |
simonfaltum
left a comment
There was a problem hiding this comment.
Review Swarm Summary (2 independent reviewers + cross-review)
Verdict: REQUEST CHANGES - The defer-based approach is correct in principle, but has 3 issues that need attention before merge.
- [IMPORTANT] Potential panic on auth/config failures - The deferred
LogDeployTelemetry()runs wheneverb != nil, even beforecmdctx.SetConfigUsed()succeeds. On auth/config failures, telemetry upload will callcmdctx.ConfigUsed(ctx)and panic. - [IMPORTANT] Double deploy events on failure - Failed deploys will emit two
BundleDeployEvents: one from the new defer and one from the root-level fallback incmd/root/root.go:185. - [IMPORTANT] PII risk in error messages -
ErrorMessageis sent verbatim with no sanitization or size bound. Many deploy errors interpolate local filesystem paths, resource names, and workspace URLs.
| // Log deploy telemetry on all exit paths. This is a defer to ensure | ||
| // telemetry is logged even when the deploy command fails, for both | ||
| // diagnostic errors and regular Go errors. | ||
| if opts.Deploy { | ||
| defer func() { | ||
| if b == nil { | ||
| return | ||
| } | ||
| errMsg := logdiag.GetFirstErrorSummary(ctx) | ||
| if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) { | ||
| errMsg = retErr.Error() | ||
| } | ||
| phases.LogDeployTelemetry(ctx, b, errMsg) | ||
| }() |
There was a problem hiding this comment.
[IMPORTANT] This defer runs whenever b != nil, but cmdctx.SetConfigUsed() may not have been called yet (e.g., if configureBundle fails on auth/profile errors before reaching SetConfigUsed in cmd/root/bundle.go:187). When telemetry upload later calls cmdctx.ConfigUsed(ctx) in libs/telemetry/logger.go, it will panic.
Fix: guard with if !cmdctx.HasConfigUsed(ctx) { return } inside the defer, or move the defer setup to after SetConfigUsed() has succeeded.
| // Log deploy telemetry on all exit paths. This is a defer to ensure | ||
| // telemetry is logged even when the deploy command fails, for both | ||
| // diagnostic errors and regular Go errors. | ||
| if opts.Deploy { | ||
| defer func() { | ||
| if b == nil { | ||
| return | ||
| } | ||
| errMsg := logdiag.GetFirstErrorSummary(ctx) | ||
| if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) { | ||
| errMsg = retErr.Error() | ||
| } | ||
| phases.LogDeployTelemetry(ctx, b, errMsg) | ||
| }() |
There was a problem hiding this comment.
[IMPORTANT] This defer will now log a full BundleDeployEvent on deploy failure. But cmd/root/root.go:182-189 still appends a legacy empty BundleDeployEvent on every nonzero bundle_deploy exit. Result: two deploy events per failed deploy, which will skew failure counts and error-rate dashboards.
Fix: remove the root-level failure fallback, or gate it on "no deploy event was already logged".
| BundleDeployEvent: &protos.BundleDeployEvent{ | ||
| BundleUuid: bundleUuid, | ||
| DeploymentId: b.Metrics.DeploymentId.String(), | ||
| ErrorMessage: errMsg, |
There was a problem hiding this comment.
[IMPORTANT] ErrorMessage is sent verbatim from logdiag.GetFirstErrorSummary() / retErr.Error(). Many deploy errors interpolate local filesystem paths or user-controlled config values (e.g., from statemgmt/state_pull.go, config/mutator/translate_paths.go). This starts shipping raw PII/workspace details to telemetry with no sanitization or size bound.
Fix: emit a sanitized error code/category, or at least scrub paths and cap length (e.g., 500 chars).
| if b == nil { | ||
| return | ||
| } | ||
| errMsg := logdiag.GetFirstErrorSummary(ctx) | ||
| if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) { | ||
| errMsg = retErr.Error() |
There was a problem hiding this comment.
[SUGGESTION] No new unit tests for the defer + error capture logic. The core behavior change (telemetry always fires, error message captured from logdiag or retErr) should have test coverage. Consider testing:
- Telemetry fires on deploy failure with error message
ErrAlreadyPrintederrors fall through toGetFirstErrorSummary- Successful deploy passes empty error message
Changes
Adds support for logging error messages encountered during
bundle deployin telemetry. This gives the developer ecosystem team visibility into user-facing errors.What changed:
libs/telemetry/protos/bundle_deploy.go: AddedErrorMessagefield toBundleDeployEventstructlibs/logdiag/logdiag.go: AddedFirstErrorSummaryfield toLogDiagDatato capture the summary of the first error diagnostic. AddedGetFirstErrorSummary()getterbundle/phases/deploy.go: MovedlogDeployTelemetryinto adeferso telemetry is always logged, even when deploy failsbundle/phases/telemetry.go: PopulatesErrorMessagefromlogdiag.GetFirstErrorSummary(ctx)