You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@workflow/world-postgres handles step_started with a conditional entity UPDATE (rejecting terminal steps via notInArray(status, terminalStepStatuses)) followed by a separate event INSERT, with no transaction or row lock spanning the two statements (packages/world-postgres/src/storage.ts, step_started branch).
Under concurrent workers, this allows the following interleaving:
Worker B: its step_started event INSERT lands after the step_completed event
The log then contains step_started, step_completed, step_started for the same correlationId, and replay fails with CorruptedEventLogError (duplicate step_started, second copy after step_completed).
Observed in CI
This was caught by the e2e regression test added in #2295 (parallelStepsThenWebhookWorkflow), which stresses same-tick replay races:
Failing job: E2E Local Postgres Tests (sveltekit) — run ended in CorruptedEventLogError because step_started for step_01KTQ64CF9C94A2WTK2JD8KJP7 appeared twice, the second copy after step_completed.
The failure is timing-sensitive, not deterministic (11/11 local stress runs of the same matrix passed).
Relationship to existing protections
The partial unique index workflow_events_entity_creation_unique (runId, correlationId, eventType) only covers entity-creation events (step_created, hook_created, wait_created). It cannot help here: duplicate step_started events are legitimate for retries, so a blanket uniqueness constraint is not the fix — the problem is the ordering of a late duplicate relative to step_completed.
Wrap the step entity UPDATE and the event INSERT for step lifecycle events in a single transaction so the step row lock serializes concurrent writers: a late step_started would then block on the row lock until step_completed commits, observe the terminal state in its conditional UPDATE, and reject with EntityConflictError before inserting its event.
Note
The parallelStepsThenWebhookWorkflow e2e test is skipped on world-postgres (WORKFLOW_TARGET_WORLD includes postgres) until this is fixed — re-enable it when this lands. This also blocks fully closing #1665 on the postgres world.
Summary
@workflow/world-postgreshandlesstep_startedwith a conditional entityUPDATE(rejecting terminal steps vianotInArray(status, terminalStepStatuses)) followed by a separate eventINSERT, with no transaction or row lock spanning the two statements (packages/world-postgres/src/storage.ts,step_startedbranch).Under concurrent workers, this allows the following interleaving:
step_startedUPDATE (status → running) + event INSERTstep_startedconditional UPDATE passes (step is stillrunning, not terminal)step_completedUPDATE (status → completed) + event INSERTstep_startedevent INSERT lands after thestep_completedeventThe log then contains
step_started, step_completed, step_startedfor the samecorrelationId, and replay fails withCorruptedEventLogError(duplicatestep_started, second copy afterstep_completed).Observed in CI
This was caught by the e2e regression test added in #2295 (
parallelStepsThenWebhookWorkflow), which stresses same-tick replay races:CorruptedEventLogErrorbecausestep_startedforstep_01KTQ64CF9C94A2WTK2JD8KJP7appeared twice, the second copy afterstep_completed.hook_createdinserts were correctly rejected by theworkflow_events_entity_creation_uniqueindex — so the hook self-conflict fix from fix(world-local,world-postgres): make duplicate hook_created idempotent #2295 worked; this is a distinct step-lifecycle ordering bug.The failure is timing-sensitive, not deterministic (11/11 local stress runs of the same matrix passed).
Relationship to existing protections
workflow_events_entity_creation_unique (runId, correlationId, eventType)only covers entity-creation events (step_created,hook_created,wait_created). It cannot help here: duplicatestep_startedevents are legitimate for retries, so a blanket uniqueness constraint is not the fix — the problem is the ordering of a late duplicate relative tostep_completed.step_createdfrom concurrent workers (addressed by the unique index). This issue is the lifecycle-event analogue that the index by design does not cover.Possible fix shape
Wrap the step entity UPDATE and the event INSERT for step lifecycle events in a single transaction so the step row lock serializes concurrent writers: a late
step_startedwould then block on the row lock untilstep_completedcommits, observe the terminal state in its conditional UPDATE, and reject withEntityConflictErrorbefore inserting its event.Note
The
parallelStepsThenWebhookWorkflowe2e test is skipped onworld-postgres(WORKFLOW_TARGET_WORLDincludespostgres) until this is fixed — re-enable it when this lands. This also blocks fully closing #1665 on the postgres world.