Skip to content

fix: Helm doesn't wait for Custom Resources on new releases using `upgrade...#32268

Open
pratheeknathani wants to merge 1 commit into
helm:mainfrom
pratheeknathani:fix-32066
Open

fix: Helm doesn't wait for Custom Resources on new releases using `upgrade...#32268
pratheeknathani wants to merge 1 commit into
helm:mainfrom
pratheeknathani:fix-32066

Conversation

@pratheeknathani

Copy link
Copy Markdown

Summary

The behavior difference between helm install --wait and helm upgrade --install --wait is not caused by wait options being dropped on the way into the install code.

Root cause

The behavior difference between helm install --wait and helm upgrade --install --wait
is not caused by wait options being dropped on the way into the install code. I verified
that pkg/cmd/upgrade.go copies WaitStrategy, WaitForJobs and Timeout onto the
install client, and both commands funnel through the same runInstall and install action,
so the wait configuration is identical.

The real cause is a race in the kstatus based status watcher (pkg/kube/statuswait.go)
when a brand new custom resource is created:

  1. Helm creates the custom resource (for example a Knative serving.knative.dev/v1
    Service) for a release that does not exist yet.
  2. Helm immediately starts watching it. The informer's initial list returns the resource
    with an empty .status because the controller has not reconciled it yet.
  3. kstatus (status.Compute in github.com/fluxcd/cli-utils) has no type specific rule
    for the custom kind, finds no status.observedGeneration mismatch and no conditions,
    and falls back to "the absence of any known conditions means the resource is current",
    returning Current.
  4. Helm's observer sees the aggregate status reach Current and stops waiting, reporting
    success before the resource is actually ready.

This matches the symptom in the issue (a CR briefly shown as Unknown, then Helm reporting
"all resources achieved desired status" before the underlying Pods are ready). It is timing
dependent, which is why it reproduces reliably in low latency environments (such as CI close
to the API server) and on first installs where the CR has no prior .status. When the
release already exists the CR already has a populated .status from a previous
reconciliation, so kstatus correctly keeps waiting and the bug does not appear.

What changed

pkg/kube/statuswait.go

  • Added customResourceStatusPending, which detects a freshly created custom resource whose
    controller has not yet observed it. It returns true only when there is positive evidence
    that a controller is expected to populate status, to keep the change safe:
  • the kind is not one of the built-in types kstatus already understands
    (status.GetLegacyConditionsFn returns nil), so Services, Deployments, Jobs, Pods,
    StatefulSets, CRDs, etc. are never affected;
  • metadata.generation is set. The API server only maintains metadata.generation for
    resources whose CRD enables the status subresource, which implies a controller is
    expected to report status. Config style CRDs without a status subresource never get a
    generation, so they are skipped and never block;
  • the controller has not yet reported status.observedGeneration or any
    status.conditions. As soon as either is present we defer to the normal kstatus
    computation, so resources that legitimately omit one of these fields are not blocked
    forever.
  • In statusObserver, when waiting for Current, a resource that is reported Current but
    is still a pending fresh custom resource is treated as InProgress for aggregation, so
    Helm keeps waiting until the controller reconciles it.
  • In wait, the timeout error reporting uses the same adjustment so a stuck custom resource
    is reported as not ready. status: InProgress instead of being silently treated as ready.

pkg/kube/statuswait_test.go

  • Added customResourceNoStatusManifest and customResourceReadyManifest fixtures
    (a Knative style serving.knative.dev/v1 Service).
  • Added TestStatusWaitCustomResourcePending with two cases:
  • a freshly created custom resource with no status is not reported ready and the wait
    times out with resource Service/ns/hello-knative not ready. status: InProgress;
  • the same resource becomes ready once a simulated controller writes
    observedGeneration and a Ready condition, and the wait then succeeds.

Issue

Fixes #32066

Issue: #32066

Diffstat

pkg/kube/statuswait.go | 52 +++++++++++++++++++++++++-
 pkg/kube/statuswait_test.go | 91 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 141 insertions(+), 2 deletions(-)

Testing

  • Ran the relevant tests and linter for the changed files while developing.

  • Kept the change minimal and focused on this one issue.

AI assistance

I used GitHub Copilot to help write parts of this change. I've reviewed and tested it myself, I understand what it does, and I'll follow up on any review feedback.

Copilot AI review requested due to automatic review settings June 25, 2026 21:53
@pull-request-size pull-request-size Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 25, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Helm’s --wait behavior for newly created custom resources by preventing kstatus’s “empty status ⇒ Current” fallback from prematurely ending the wait during fresh installs (notably helm upgrade --install --wait).

Changes:

  • Adjusts status aggregation and timeout reporting to treat certain “fresh” custom resources as InProgress even if kstatus reports Current.
  • Adds customResourceStatusPending heuristic to detect the “fresh CR with empty status” window.
  • Adds test coverage for the pending→ready transition using a Knative-style CR fixture.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/kube/statuswait.go Adds heuristic + aggregation/error-reporting adjustments to avoid early success on fresh custom resources with empty status.
pkg/kube/statuswait_test.go Adds fixtures and a new test to validate waiting behavior for a fresh custom resource until status is populated.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/kube/statuswait.go Outdated
Comment on lines +319 to +325
if _, found, _ := unstructured.NestedInt64(u.Object, "status", "observedGeneration"); found {
return false
}
if conditions, found, _ := unstructured.NestedSlice(u.Object, "status", "conditions"); found && len(conditions) > 0 {
return false
}
return true
Comment thread pkg/kube/statuswait.go Outdated
Comment on lines +299 to +303
// - The resource is not one of the built-in types kstatus already understands.
// - metadata.generation is set, which the API server only maintains for
// resources whose CRD enables the status subresource.
// - The controller has not yet reported status.observedGeneration or any
// status.conditions.
Comment thread pkg/kube/statuswait_test.go Outdated
Comment on lines +532 to +540
// Simulate the controller reconciling the resource after a short delay.
go func() {
time.Sleep(time.Millisecond * 500)
ready := getRuntimeObjFromManifests(t, []string{customResourceReadyManifest})[0].(*unstructured.Unstructured)
err := fakeClient.Tracker().Update(gvr, ready, ready.GetNamespace())
assert.NoError(t, err)
}()
err = statusWaiter.Wait(resourceListFor(objs), time.Second*10)
assert.NoError(t, err)
…grade --

Fixes helm#32066

Signed-off-by: Pratheek Nathani <181894361+pratheeknathani@users.noreply.github.com>
@pratheeknathani

Copy link
Copy Markdown
Author

Thanks for the review! I've addressed all three comments in 166c56b:

  1. Empty-status heuristic (High): customResourceStatusPending now returns false as soon as .status is a non-empty map, instead of only checking status.observedGeneration / status.conditions. This matches the documented "empty status" intent and avoids holding a resource in InProgress when a controller has populated other status fields.

  2. Misleading metadata.generation comment (Medium): Reworded the doc comment — generation is the spec generation the API server tracks for the object; I dropped the incorrect "only maintained when the status subresource is enabled" claim.

  3. Test data races (Medium): Each parallel subtest now builds its own runtime.NewScheme() instead of mutating the global scheme.Scheme (the dominant race under -race). The controller-simulation goroutine no longer calls testify assertions — it reports its error over a channel that the test goroutine drains — and the test now asserts that Wait blocks until the simulated reconcile, so it fails if Wait returns early.

Verified locally: go vet ./pkg/kube/ is clean, the target test passes under -race -count=3, and the full pkg/kube package passes under -race -shuffle=on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helm doesn't wait for Custom Resources on new releases using upgrade --install.

2 participants