Skip to content

Self-hosted v4.4.3: docker-provider never processes runs (PENDING indefinitely) #3279

@juniorcammel

Description

@juniorcammel

Self-hosted v4.4.3: docker-provider never processes runs (PENDING indefinitely)

Environment

  • trigger.dev version: v4.4.3 (webapp, coordinator, docker-provider all same version)
  • Deployment: Docker Swarm via Portainer CE
  • OS: Debian 12 Bookworm
  • Docker: 29.3.0
  • PostgreSQL: 16
  • Redis: 7
  • ElectricSQL: latest
  • ClickHouse: 25.8 (external, shared)

Description

Runs triggered via the API are accepted and enqueued in Redis, but the docker-provider never processes them. Runs remain in PENDING status indefinitely. Development mode works (via npx trigger.dev dev), but Production deployed execution does not.

What works

  • Webapp: running, healthy, API responds correctly
  • Coordinator: connects via WebSocket, receives DYNAMIC_CONFIG ✓
  • Docker-provider: connects via WebSocket, receives SERVER_READY + PRE_PULL_DEPLOYMENT ✓
  • ElectricSQL: running, replication active ✓
  • Deployments: DEPLOYED status in Production (ic3etnvx, 20260326.2, 2 tasks) ✓
  • Dev mode: health-check task runs COMPLETED_SUCCESSFULLY in Development ✓
  • Registry: docker login works, credentials configured ✓
  • API trigger: returns run ID successfully ✓

What doesn't work

  • Runs in Production stay PENDING forever
  • Docker-provider shows zero activity after SERVER_READY (no pull, spawn, create, task, or run logs)
  • Zero ephemeral task containers are ever created
  • TaskRunAttempt table: 0 rows (no attempts ever recorded)

Detailed investigation

1. Redis queues have the messages

engine:runqueue:workerQueue:cmn40rrgz0005qu1rihgeecsx-default → 3 messages (list type)
engine:runqueue:{org:...}:message:cmn7oqa7100011rqiy79ocv3w
engine:runqueue:{org:...}:message:cmn7nzcrp00001rqi88hoff19
engine:runqueue:{org:...}:message:cmn7nhm0v00091robjnlq74fg

Messages are correctly enqueued but never dequeued.

2. SharedQueueConsumer reports no messages

The webapp logs show:

{"reasonStats":{"no_message_dequeued":10},"actionStats":{},"outcomeStats":{"noop":10}}

The consumers iterate but find nothing to dequeue, despite messages existing in the workerQueue.

3. WorkerInstanceGroup was manually created

The WorkerInstanceGroup table was empty (0 rows). We manually created:

INSERT INTO "WorkerInstanceGroup" (id, type, name, masterQueue, hidden, tokenId, organizationId, projectId, ...)
VALUES ('...', 'MANAGED', 'default', '<projectId>-default', false, '<tokenId>', '<orgId>', '<projectId>', ...);

UPDATE "Project" SET "defaultWorkerGroupId" = '<groupId>' WHERE id = '<projectId>';

After this fix, the API stopped returning "No worker group found" and started accepting runs. But runs still don't execute.

4. Environment vars added to provider/coordinator

Initially, docker-provider and coordinator were missing DATABASE_URL, REDIS_HOST, REDIS_PORT, REDIS_PASSWORD. We added them (matching the webapp's values). No change in behavior.

5. Docker-provider logs (complete from startup)

new zod socket → ws://trigger-webapp:3030/provider
new zod socket → ws://trigger-webapp:3030/shared-queue
Initializing task operations
server listening on port 8809
connect (socket-provider) ✓
connect (socket-shared-queue) ✓
Incoming event SERVER_READY ✓
No checkpoint support: Please enable docker experimental features.
Simulation mode enabled. Containers will be paused, not checkpointed.

After this: complete silence. No dequeue, no pull, no spawn, no task activity.

6. Coordinator logs (complete from startup)

Docker mode
connecting → ws://trigger-webapp:3030/coordinator
server listening on port 9020
connect (socket-coordinator) ✓
Incoming event DYNAMIC_CONFIG ✓
Handling DYNAMIC_CONFIG (version v1, checkpointThresholdInMs 30000)
No checkpoint support: Please enable docker experimental features.
Simulation mode enabled.

After this: only healthcheck /health requests. Zero run-related activity.

Questions

  1. Is the WorkerInstanceGroup supposed to be created automatically? In our self-hosted setup, both WorkerInstanceGroup and WorkerGroupToken tables were empty after initial deployment. The Regions page shows "Default worker instance group not found" with no option to create one.

  2. What triggers the docker-provider to dequeue and process runs? It receives SERVER_READY but never seems to poll or receive run assignments.

  3. Is the SharedQueueConsumer in the webapp supposed to read from engine:runqueue:workerQueue:* and forward to the provider? It reports no_message_dequeued despite messages existing in the queue.

  4. Is there a missing env var or configuration step for self-hosted Production execution that isn't in the template? The official docker-compose.yml and .env.example don't mention anything about worker groups.

Compose structure

Using the official template structure adapted for Docker Swarm:

  • webapp (ghcr.io/triggerdotdev/trigger.dev:v4.4.3)
  • postgres (16)
  • redis (7)
  • electric (latest)
  • docker-provider (ghcr.io/triggerdotdev/provider/docker:v4.4.3)
  • coordinator (ghcr.io/triggerdotdev/coordinator:v4.4.3)

All on the same overlay network. Communication between services verified (HTTP + WebSocket).

Environment variables (provider)

PLATFORM_HOST=trigger-webapp
PLATFORM_WS_PORT=3030
SECURE_CONNECTION=false
PLATFORM_SECRET=<set>
COORDINATOR_HOST=trigger-coordinator
COORDINATOR_PORT=9020
REGISTRY_HOST=registry.junior.pro
REGISTRY_NAMESPACE=dev/utmlab/trigger
REGISTRY_USERNAME=<set>
REGISTRY_PASSWORD=<set>
DATABASE_URL=postgresql://trigger:<pass>@trigger-postgres:5432/trigger
REDIS_HOST=trigger-redis
REDIS_PORT=6379
REDIS_PASSWORD=<set>
NODE_ENV=production
V3_ENABLED=true
RUNTIME_PLATFORM=docker-compose

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions