Skip to content

fix: set FailureAction=rollback for swarm services default UpdateConfig#3810

Open
jaimehgb wants to merge 1 commit intoDokploy:canaryfrom
jaimehgb:fix/swarm-convergence
Open

fix: set FailureAction=rollback for swarm services default UpdateConfig#3810
jaimehgb wants to merge 1 commit intoDokploy:canaryfrom
jaimehgb:fix/swarm-convergence

Conversation

@jaimehgb
Copy link

@jaimehgb jaimehgb commented Feb 26, 2026

Problem

Several issues have been reported about orphan containers piling up in Swarm deployments -- services stuck at Replicas: N/1 with multiple healthy containers that never go away.

See #1669, #2223, #2911, #2150

What's actually happening

Docker Swarm defaults FailureAction to "pause". If a task fails or gets killed mid-update (app crash, rapid deploys stepping on each other), Swarm pauses the update and stops reconciling. The extra containers sit there forever, healthy or not.

We confirmed this on a production cluster:

$ docker service inspect <service> --format '{{json .UpdateStatus}}'
{
    "State": "paused",
    "StartedAt": "2026-02-27T23:44:07.480239109Z",
    "Message": "update paused due to failure or early termination of task l38gsrsqg2rl..."
}

5 healthy containers sitting there for 30+ hours. Replicas: 5/1.

Fix

Sets better defaults for UpdateConfig and RollbackConfig in Swarm services.

This isn't really a Dokploy bug -- it's a Docker Swarm default that happens to be a bad fit for a deployment platform. Most users won't know FailureAction exists, let alone that it defaults to "pause". Setting it to "rollback" makes Swarm revert to the previous working spec when a deploy fails, instead of freezing mid-update.

Only affects the default config. Users who have set their own updateConfigSwarm or rollbackConfigSwarm in the UI are not touched.

+// default rollback config to match update config
+RollbackConfig: {
+    Parallelism: 1,
+    Order: "start-first",
+},
+
 // default config if no updateConfigSwarm provided
 UpdateConfig: {
     Parallelism: 1,
     Order: "start-first",
+    FailureAction: "rollback",
 },

The RollbackConfig default sets Order: "start-first" to match the update order. Without it, Docker defaults rollbacks to "stop-first", which briefly takes the service down during rollback.

Why rollback?

We tested all three options:

Value On failure Orphans? Availability
pause (current) Freezes everything Yes, permanent Old tasks survive by accident
continue Keeps retrying No Service goes down (broken deploy completes, kills healthy tasks)
rollback (this PR) Reverts to previous spec No Previous version stays up

continue looked promising but it actually pushes the broken deploy through to completion, killing healthy tasks in the process. rollback is the only option that both prevents orphans and keeps the service available.

Reproduction

Script that reproduces the bug and verifies the fix on any Swarm node (no Dokploy needed):

Reproduction script and docs (Gist)

docker swarm init  # if not already
curl -sL https://gist.githubusercontent.com/jaimehgb/6ae57f6a079bf389ed57fe18c4fd3877/raw/reproduce-orphan-bug.sh | bash

Testing

  • Reproduced locally (3 healthy orphans, UpdateStatus=paused)
  • Reproduced on production Dokploy cluster (5 healthy orphans, 30+ hours)
  • Verified FailureAction=rollback prevents orphans (1 task, rollback_completed)
  • Verified FailureAction=continue prevents orphans but kills the service (0/1)
  • Built custom Dokploy image, deployed to local Swarm, confirmed service gets correct UpdateConfig and RollbackConfig
  • Confirmed custom updateConfigSwarm/rollbackConfigSwarm are not overwritten

@jaimehgb jaimehgb force-pushed the fix/swarm-convergence branch from 0b7ef69 to 0357eff Compare February 28, 2026 22:18
@jaimehgb jaimehgb changed the title fix: wait for swarm task convergence after service update fix: set FailureAction=rollback for swarm services default UpdateConfig Feb 28, 2026
Docker Swarm's default FailureAction is "pause". When a task fails or is
terminated early during a rolling update, Swarm pauses the update and
stops ALL reconciliation — orphan containers persist indefinitely, even
when healthy. This is the root cause of orphan container issues reported
in production (services showing Replicas: N/1 with multiple healthy
containers that never get cleaned up).

Setting FailureAction to "rollback" makes Swarm automatically revert to
the previous working service spec on failure, preventing orphans while
preserving service availability. Also adds a default RollbackConfig with
Order: "start-first" to match the update config (Docker defaults rollback
to "stop-first" otherwise).

Only affects the default config — users who have configured their own
updateConfigSwarm/rollbackConfigSwarm are not affected.

Relates to Dokploy#1669, Dokploy#2223, Dokploy#2911, Dokploy#2150
@jaimehgb jaimehgb force-pushed the fix/swarm-convergence branch from 0357eff to fadc7fe Compare February 28, 2026 23:20
@jaimehgb jaimehgb marked this pull request as ready for review February 28, 2026 23:26
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 28, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@dosubot dosubot bot added the bug Something isn't working label Feb 28, 2026
@dosubot
Copy link

dosubot bot commented Feb 28, 2026

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant