Skip to content

Keep canary alive when primary promotion fails#1931

Open
pedrampdd wants to merge 1 commit into
fluxcd:mainfrom
pedrampdd:fix/1898-keep-canary-on-promotion-failure
Open

Keep canary alive when primary promotion fails#1931
pedrampdd wants to merge 1 commit into
fluxcd:mainfrom
pedrampdd:fix/1898-keep-canary-on-promotion-failure

Conversation

@pedrampdd

@pedrampdd pedrampdd commented Jun 11, 2026

Copy link
Copy Markdown

Problem

Reported in #1898. When a primary pod fails to initialize after a canary
promotion, Flagger takes the application down instead of preserving the
healthy canary.

Flow:

  1. The canary analysis succeeds and Flagger copies the canary pod spec to the
    primary (Promote), then moves to the Promoting/Finalising phase and
    waits for the primary rollout to finish.
  2. The promoted primary fails to become ready (bad image, failing sidecar,
    slow/again-failing init, etc.).
  3. IsPrimaryReady eventually returns a non-retriable error (progress deadline
    exceeded), which triggered the standard analysis rollback().
  4. rollback() routes all traffic to the primary and scales the canary to
    zero.

The problem is that during promotion the primary already runs the new
(failing) spec, while the canary is the only healthy copy of the new revision
still serving traffic. "Rolling back to the primary" therefore sends all
traffic to the broken primary and deletes the only working pods — a full
outage (worst in Recreate mode, where no old primary pod remains).

rollback() is correct for an analysis failure during Progressing (there the
primary still holds the old, good spec), but wrong once promotion has started.

Fix

When IsPrimaryReady returns a non-retriable error and the canary is in the
Promoting or Finalising phase, halt the promotion instead of rolling back:

  • mark the rollout as Failed and emit a warning event + alert, so it stops
    advancing and surfaces the failure;
  • do not route traffic to the unhealthy primary;
  • do not scale the canary to zero.

The canary keeps serving traffic until the primary recovers or a corrected
revision is applied. Behaviour during Progressing (and every other phase) is
unchanged.

This is the minimal, non-destructive safety fix. Follow-up #1932 tracks the
model-correct behaviour — note that a promotion only starts after the canary
passes analysis, so the canary running the new revision is healthy and only the
primary's separately-rendered copy failed; whether Flagger should revert the
primary to its last-known-good spec or keep serving the healthy canary is an
open question to settle there.

Tests

Added TestScheduler_DeploymentPromotionPrimaryNotReady, which drives the
canary to Promoting, makes the primary stuck (ProgressDeadlineExceeded),
and asserts the canary is not scaled to zero and traffic is not shifted onto
the broken primary. The full pkg/controller and pkg/canary suites pass
(go test ./pkg/controller/ ./pkg/canary/), gofmt and go vet are clean.

Fixes #1898

When the canary analysis succeeds, Flagger copies the canary pod spec
to the primary and waits for the primary rollout to finish. If the
primary fails to become ready, the non-retriable readiness error
triggered the standard analysis rollback, which routes all traffic to
the primary and scales the canary to zero.

During promotion the primary already runs the new (failing) spec while
the canary is the only healthy copy of the new revision still serving
traffic. Rolling back therefore sends all traffic to the broken primary
and deletes the working canary, taking the application down.

Halt the promotion instead: when the primary is not ready and the canary
is in the Promoting or Finalising phase, mark the rollout as failed and
alert, but keep the canary running and leave routing untouched until the
primary recovers or a corrected revision is applied.

Fixes fluxcd#1898

Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
@pedrampdd pedrampdd force-pushed the fix/1898-keep-canary-on-promotion-failure branch from eb69ba1 to 120c187 Compare June 11, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

If a primary pod fails to initialize, flagger doesn't always do the right thing

1 participant