Skip to content

@workflow/world-postgres: late concurrent step_started can land after step_completed in the event log (non-atomic entity UPDATE + event INSERT) #2331

@TooTallNate

Description

@TooTallNate

Summary

@workflow/world-postgres handles step_started with a conditional entity UPDATE (rejecting terminal steps via notInArray(status, terminalStepStatuses)) followed by a separate event INSERT, with no transaction or row lock spanning the two statements (packages/world-postgres/src/storage.ts, step_started branch).

Under concurrent workers, this allows the following interleaving:

  1. Worker A: step_started UPDATE (status → running) + event INSERT
  2. Worker B: step_started conditional UPDATE passes (step is still running, not terminal)
  3. Worker A: step_completed UPDATE (status → completed) + event INSERT
  4. Worker B: its step_started event INSERT lands after the step_completed event

The log then contains step_started, step_completed, step_started for the same correlationId, and replay fails with CorruptedEventLogError (duplicate step_started, second copy after step_completed).

Observed in CI

This was caught by the e2e regression test added in #2295 (parallelStepsThenWebhookWorkflow), which stresses same-tick replay races:

The failure is timing-sensitive, not deterministic (11/11 local stress runs of the same matrix passed).

Relationship to existing protections

  • The partial unique index workflow_events_entity_creation_unique (runId, correlationId, eventType) only covers entity-creation events (step_created, hook_created, wait_created). It cannot help here: duplicate step_started events are legitimate for retries, so a blanket uniqueness constraint is not the fix — the problem is the ordering of a late duplicate relative to step_completed.
  • Related: @workflow/world-postgres can insert duplicate step_created events under concurrent workers #2039 covered duplicate step_created from concurrent workers (addressed by the unique index). This issue is the lifecycle-event analogue that the index by design does not cover.

Possible fix shape

Wrap the step entity UPDATE and the event INSERT for step lifecycle events in a single transaction so the step row lock serializes concurrent writers: a late step_started would then block on the row lock until step_completed commits, observe the terminal state in its conditional UPDATE, and reject with EntityConflictError before inserting its event.

Note

The parallelStepsThenWebhookWorkflow e2e test is skipped on world-postgres (WORKFLOW_TARGET_WORLD includes postgres) until this is fixed — re-enable it when this lands. This also blocks fully closing #1665 on the postgres world.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions