Skip to content

Idempotency follow-ups: atomic keyed start(), post-completion dedupe, and attribute-based run lookup #2376

@pranaygp

Description

@pranaygp

Follow-up to the hook-based run idempotency work in #2015, #2373, and #2011. Those PRs ship the in-flight story: deterministic hook tokens as run idempotency keys, hook.getConflict() resolving with the conflicting Run, and code-driven conflict-handling strategies (reject, adopt result, inspect, signal via resumeHook(), supersede via cancel()). This issue tracks the structural gaps that remain before the idempotency story is rock-solid.

Where we stand vs. comparable frameworks

Idempotency is a lifecycle; each framework answers four questions:

Phase Temporal Inngest DBOS Workflow (today)
Admission (dedupe atomic with start?) Yes — workflow ID enforced server-side at start Yes — idempotency key dedupes scheduling Yes — workflow ID is effectively a DB primary key Nostart() always creates a run; the claim happens inside the run body
In-flight conflict policy Enum: Fail / UseExisting / TerminateExisting None (first wins) Implicit UseExisting CodegetConflict() hands the duplicate the owner's Run
Post-completion memory RejectDuplicate, bounded by namespace retention Fixed 24h TTL per key Durable record None — hook released at terminal state
Result reuse No (rejects; caller queries the closed run) Yes, within window Yes — same ID returns stored result Only while the owner is running (conflict.returnValue)

Where we're ahead: in-flight expressiveness. Policy-as-code beats a static enum ("inspect the owner's status, then decide" or "forward this request's payload to the owner" are inexpressible as configuration). And because the duplicate run is itself durable, the conflict-handling logic gets retries, replay, and observability — in admission-time systems that logic lives in a crashable client.

Where we're behind: admission atomicity and post-completion memory. Notably, no framework offers unbounded memory — Temporal's reject-duplicate is retention-bounded, Inngest is 24h — so the target is a retention window, not forever.

Gaps

  1. Admission isn't atomic (root cause of the rest). The duplicate run is created, billed, queued, and executed before discovering it's a duplicate, and routes need the resume-with-retry dance to bridge the start() → hook-registration window. The docs already note a native atomic start-and-hook-registration API is planned.
  2. No post-completion dedupe. The hook is a lease and the lease dies with the run. A retried request arriving seconds after the owner completes starts fresh duplicate-sensitive work. (Temporal: retention-bounded reject; DBOS: returns the stored result.)
  3. Attributes are writable but not queryable. Runs can set (experimental_setAttributes) and even seed (CreateWorkflowRunParams.attributes) attributes, but ListWorkflowRunsParams only filters by workflowName/status — nothing can find a run by attribute, so attributes can't yet serve as post-completion memory.
  4. Optimistic-strategy races (acceptable, but inherent). Supersede (cancel-and-reclaim) can lose the reclaim to a third arrival (ABA-shaped; the documented retry loop handles it); signal-the-owner can hit HookNotFoundError if the owner completes mid-forward.
  5. Flat token namespace. order:123 collides across unrelated workflows sharing a key scheme. Same property as Temporal; a documented prefix convention is probably sufficient.

Recommendations (in order)

  1. P0 — atomic keyed start() with { run, created } return semantics and a retention-bounded uniqueness window. Closes gaps 1 and 2 at once and yields DBOS-style result reuse (created === false + completed → run.returnValue) while keeping policies in code: no policy enum — the caller inspects the existing run and decides, which subsumes both of Temporal's knobs (conflict policy + reuse policy).
  2. P1 — runs.list attribute filtering in the World contract, then document the attribute pattern as the post-completion bridge: hook claim = in-flight mutex; attribute (idempotency: <key>) = retention-bounded memory; a duplicate that wins the token after the owner finished queries completed runs by attribute in a step and adopts the prior result via getRun(prior.runId).returnValue. Must be documented as advisory (the query and subsequent work aren't atomic — residual race in the just-completed window) and retention-bounded. Stays useful after keyed start lands, for richer queries.
  3. Document the two patterns that work today so users aren't stranded: the entity pattern (a long-lived run per key looping for await on its hook — strict serialization at the cost of one perpetual run per key and deployment pinning) and the app-record pattern (store runId under the domain key in your own DB inside a step; replays resolve via getRun).
  4. Non-goal: a standalone lease/TTL primitive or a dispose: false hook option. A claim that outlives its run is a leak generator; keyed start with a retention window does the same job with better semantics.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions