fix: set FailureAction=rollback for swarm services default UpdateConfig#3810
Open
jaimehgb wants to merge 1 commit intoDokploy:canaryfrom
Open
fix: set FailureAction=rollback for swarm services default UpdateConfig#3810jaimehgb wants to merge 1 commit intoDokploy:canaryfrom
jaimehgb wants to merge 1 commit intoDokploy:canaryfrom
Conversation
0b7ef69 to
0357eff
Compare
Docker Swarm's default FailureAction is "pause". When a task fails or is terminated early during a rolling update, Swarm pauses the update and stops ALL reconciliation — orphan containers persist indefinitely, even when healthy. This is the root cause of orphan container issues reported in production (services showing Replicas: N/1 with multiple healthy containers that never get cleaned up). Setting FailureAction to "rollback" makes Swarm automatically revert to the previous working service spec on failure, preventing orphans while preserving service availability. Also adds a default RollbackConfig with Order: "start-first" to match the update config (Docker defaults rollback to "stop-first" otherwise). Only affects the default config — users who have configured their own updateConfigSwarm/rollbackConfigSwarm are not affected. Relates to Dokploy#1669, Dokploy#2223, Dokploy#2911, Dokploy#2150
0357eff to
fadc7fe
Compare
Contributor
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Several issues have been reported about orphan containers piling up in Swarm deployments -- services stuck at
Replicas: N/1with multiple healthy containers that never go away.See #1669, #2223, #2911, #2150
What's actually happening
Docker Swarm defaults
FailureActionto"pause". If a task fails or gets killed mid-update (app crash, rapid deploys stepping on each other), Swarm pauses the update and stops reconciling. The extra containers sit there forever, healthy or not.We confirmed this on a production cluster:
5 healthy containers sitting there for 30+ hours.
Replicas: 5/1.Fix
Sets better defaults for
UpdateConfigandRollbackConfigin Swarm services.This isn't really a Dokploy bug -- it's a Docker Swarm default that happens to be a bad fit for a deployment platform. Most users won't know
FailureActionexists, let alone that it defaults to"pause". Setting it to"rollback"makes Swarm revert to the previous working spec when a deploy fails, instead of freezing mid-update.Only affects the default config. Users who have set their own
updateConfigSwarmorrollbackConfigSwarmin the UI are not touched.The
RollbackConfigdefault setsOrder: "start-first"to match the update order. Without it, Docker defaults rollbacks to"stop-first", which briefly takes the service down during rollback.Why
rollback?We tested all three options:
pause(current)continuerollback(this PR)continuelooked promising but it actually pushes the broken deploy through to completion, killing healthy tasks in the process.rollbackis the only option that both prevents orphans and keeps the service available.Reproduction
Script that reproduces the bug and verifies the fix on any Swarm node (no Dokploy needed):
Reproduction script and docs (Gist)
Testing
FailureAction=rollbackprevents orphans (1 task, rollback_completed)FailureAction=continueprevents orphans but kills the service (0/1)updateConfigSwarm/rollbackConfigSwarmare not overwritten