Keep canary alive when primary promotion fails#1931
Open
pedrampdd wants to merge 1 commit into
Open
Conversation
When the canary analysis succeeds, Flagger copies the canary pod spec to the primary and waits for the primary rollout to finish. If the primary fails to become ready, the non-retriable readiness error triggered the standard analysis rollback, which routes all traffic to the primary and scales the canary to zero. During promotion the primary already runs the new (failing) spec while the canary is the only healthy copy of the new revision still serving traffic. Rolling back therefore sends all traffic to the broken primary and deletes the working canary, taking the application down. Halt the promotion instead: when the primary is not ready and the canary is in the Promoting or Finalising phase, mark the rollout as failed and alert, but keep the canary running and leave routing untouched until the primary recovers or a corrected revision is applied. Fixes fluxcd#1898 Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
eb69ba1 to
120c187
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Reported in #1898. When a primary pod fails to initialize after a canary
promotion, Flagger takes the application down instead of preserving the
healthy canary.
Flow:
primary (
Promote), then moves to thePromoting/Finalisingphase andwaits for the primary rollout to finish.
slow/again-failing init, etc.).
IsPrimaryReadyeventually returns a non-retriable error (progress deadlineexceeded), which triggered the standard analysis
rollback().rollback()routes all traffic to the primary and scales the canary tozero.
The problem is that during promotion the primary already runs the new
(failing) spec, while the canary is the only healthy copy of the new revision
still serving traffic. "Rolling back to the primary" therefore sends all
traffic to the broken primary and deletes the only working pods — a full
outage (worst in
Recreatemode, where no old primary pod remains).rollback()is correct for an analysis failure duringProgressing(there theprimary still holds the old, good spec), but wrong once promotion has started.
Fix
When
IsPrimaryReadyreturns a non-retriable error and the canary is in thePromotingorFinalisingphase, halt the promotion instead of rolling back:Failedand emit a warning event + alert, so it stopsadvancing and surfaces the failure;
The canary keeps serving traffic until the primary recovers or a corrected
revision is applied. Behaviour during
Progressing(and every other phase) isunchanged.
This is the minimal, non-destructive safety fix. Follow-up #1932 tracks the
model-correct behaviour — note that a promotion only starts after the canary
passes analysis, so the canary running the new revision is healthy and only the
primary's separately-rendered copy failed; whether Flagger should revert the
primary to its last-known-good spec or keep serving the healthy canary is an
open question to settle there.
Tests
Added
TestScheduler_DeploymentPromotionPrimaryNotReady, which drives thecanary to
Promoting, makes the primary stuck (ProgressDeadlineExceeded),and asserts the canary is not scaled to zero and traffic is not shifted onto
the broken primary. The full
pkg/controllerandpkg/canarysuites pass(
go test ./pkg/controller/ ./pkg/canary/),gofmtandgo vetare clean.Fixes #1898