Skip to content

Add error_message field to bundle deploy telemetry#4793

Open
shreyas-goenka wants to merge 1 commit intomainfrom
bundle-deploy-error-message
Open

Add error_message field to bundle deploy telemetry#4793
shreyas-goenka wants to merge 1 commit intomainfrom
bundle-deploy-error-message

Conversation

@shreyas-goenka
Copy link
Contributor

@shreyas-goenka shreyas-goenka commented Mar 19, 2026

Changes

Adds support for logging error messages encountered during bundle deploy in telemetry. This gives the developer ecosystem team visibility into user-facing errors.

What changed:

  • libs/telemetry/protos/bundle_deploy.go: Added ErrorMessage field to BundleDeployEvent struct
  • libs/logdiag/logdiag.go: Added FirstErrorSummary field to LogDiagData to capture the summary of the first error diagnostic. Added GetFirstErrorSummary() getter
  • bundle/phases/deploy.go: Moved logDeployTelemetry into a defer so telemetry is always logged, even when deploy fails
  • bundle/phases/telemetry.go: Populates ErrorMessage from logdiag.GetFirstErrorSummary(ctx)

@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from c7ee5c8 to cb59f1d Compare March 19, 2026 13:01
@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Mar 19, 2026

Commit: 1a8ff40

Run: 23296963737

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 1 9 268 797 7:24
🟨​ aws windows 7 1 9 270 795 7:00
🔄​ aws-ucws linux 2 7 9 364 712 7:54
🔄​ aws-ucws windows 2 7 9 366 710 5:47
💚​ azure linux 2 11 271 795 6:00
💚​ azure windows 2 11 273 793 4:16
🔄​ azure-ucws linux 2 1 11 369 708 7:55
🔄​ azure-ucws windows 2 1 11 371 706 6:57
💚​ gcp linux 2 11 267 798 5:51
💚​ gcp windows 2 11 269 796 4:26
18 interesting tests: 9 SKIP, 7 KNOWN, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 🔄​f 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/ssh/connect-serverless-gpu 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s
🔄​ TestAccept/ssh/connection 💚​R 💚​R 🔄​f 💚​R 💚​R 💚​R 🔄​f 🔄​f 💚​R 💚​R
Top 20 slowest tests (at least 2 minutes):
duration env testname
3:34 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:13 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:13 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:08 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:58 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:56 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:53 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:44 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:41 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:41 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:39 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:36 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:33 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:18 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:15 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:12 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:09 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:08 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:04 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Track the first error diagnostic summary encountered during bundle deploy
in telemetry. Move telemetry logging into a defer so it's always captured,
even when deploy fails.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from cb59f1d to 1a8ff40 Compare March 19, 2026 13:21
@shreyas-goenka shreyas-goenka marked this pull request as ready for review March 19, 2026 13:24
@github-actions
Copy link

Suggested reviewers

Based on git history of the changed files, these people are best suited to review:

  • @denik -- recent work in bundle/phases/, cmd/bundle/utils/, libs/logdiag/

Confidence: high

Eligible reviewers

Based on CODEOWNERS, these people or teams could also review:

@andrewnester, @anton-107, @pietern, @simonfaltum

Suggestions based on git history of 5 changed files (5 scored). See CODEOWNERS for path-specific ownership rules.

Copy link
Member

@simonfaltum simonfaltum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Swarm Summary (2 independent reviewers + cross-review)

Verdict: REQUEST CHANGES - The defer-based approach is correct in principle, but has 3 issues that need attention before merge.

  1. [IMPORTANT] Potential panic on auth/config failures - The deferred LogDeployTelemetry() runs whenever b != nil, even before cmdctx.SetConfigUsed() succeeds. On auth/config failures, telemetry upload will call cmdctx.ConfigUsed(ctx) and panic.
  2. [IMPORTANT] Double deploy events on failure - Failed deploys will emit two BundleDeployEvents: one from the new defer and one from the root-level fallback in cmd/root/root.go:185.
  3. [IMPORTANT] PII risk in error messages - ErrorMessage is sent verbatim with no sanitization or size bound. Many deploy errors interpolate local filesystem paths, resource names, and workspace URLs.

Comment on lines +98 to +111
// Log deploy telemetry on all exit paths. This is a defer to ensure
// telemetry is logged even when the deploy command fails, for both
// diagnostic errors and regular Go errors.
if opts.Deploy {
defer func() {
if b == nil {
return
}
errMsg := logdiag.GetFirstErrorSummary(ctx)
if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) {
errMsg = retErr.Error()
}
phases.LogDeployTelemetry(ctx, b, errMsg)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT] This defer runs whenever b != nil, but cmdctx.SetConfigUsed() may not have been called yet (e.g., if configureBundle fails on auth/profile errors before reaching SetConfigUsed in cmd/root/bundle.go:187). When telemetry upload later calls cmdctx.ConfigUsed(ctx) in libs/telemetry/logger.go, it will panic.

Fix: guard with if !cmdctx.HasConfigUsed(ctx) { return } inside the defer, or move the defer setup to after SetConfigUsed() has succeeded.

Comment on lines +98 to +111
// Log deploy telemetry on all exit paths. This is a defer to ensure
// telemetry is logged even when the deploy command fails, for both
// diagnostic errors and regular Go errors.
if opts.Deploy {
defer func() {
if b == nil {
return
}
errMsg := logdiag.GetFirstErrorSummary(ctx)
if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) {
errMsg = retErr.Error()
}
phases.LogDeployTelemetry(ctx, b, errMsg)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT] This defer will now log a full BundleDeployEvent on deploy failure. But cmd/root/root.go:182-189 still appends a legacy empty BundleDeployEvent on every nonzero bundle_deploy exit. Result: two deploy events per failed deploy, which will skew failure counts and error-rate dashboards.

Fix: remove the root-level failure fallback, or gate it on "no deploy event was already logged".

BundleDeployEvent: &protos.BundleDeployEvent{
BundleUuid: bundleUuid,
DeploymentId: b.Metrics.DeploymentId.String(),
ErrorMessage: errMsg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT] ErrorMessage is sent verbatim from logdiag.GetFirstErrorSummary() / retErr.Error(). Many deploy errors interpolate local filesystem paths or user-controlled config values (e.g., from statemgmt/state_pull.go, config/mutator/translate_paths.go). This starts shipping raw PII/workspace details to telemetry with no sanitization or size bound.

Fix: emit a sanitized error code/category, or at least scrub paths and cap length (e.g., 500 chars).

Comment on lines +103 to +108
if b == nil {
return
}
errMsg := logdiag.GetFirstErrorSummary(ctx)
if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) {
errMsg = retErr.Error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] No new unit tests for the defer + error capture logic. The core behavior change (telemetry always fires, error message captured from logdiag or retErr) should have test coverage. Consider testing:

  • Telemetry fires on deploy failure with error message
  • ErrAlreadyPrinted errors fall through to GetFirstErrorSummary
  • Successful deploy passes empty error message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants