Skip to content

feat: durable execution (suspend/resume + ctx.durable journal)#786

Open
andreahlert wants to merge 57 commits into
apache:mainfrom
andreahlert:worktree-durable-execution
Open

feat: durable execution (suspend/resume + ctx.durable journal)#786
andreahlert wants to merge 57 commits into
apache:mainfrom
andreahlert:worktree-durable-execution

Conversation

@andreahlert
Copy link
Copy Markdown
Collaborator

Summary

Adds durable execution to Burr, unifying two approaches in one design:

  • A: action-boundary suspend/resume via __context.suspend(channel, schema=, metadata=). The signal propagates through _Suspended(BaseException), is persisted as a SuspensionRecord, and replayed by resume(...) / aresume(...).
  • B: __context.durable(key, fn, ...) / __context.adurable(...) sub-step journal. Memoizes side effects across re-runs, with DeterminismError raised fail-loud on mismatch.

Closes the gap where Burr could not natively pause for human input, external events, or crash recovery without manual state-fork workarounds.

What's new

  • burr/core/durable.py_Suspended, SuspensionRecord, JournalEntry, DeterminismError, supports_durable_storage().
  • burr/core/resume.pyresume() and aresume() rebuild the app from graph + persister, deliver the payload, run to the next halt.
  • ApplicationContext.suspend / durable / adurable + _handle_suspension / _ahandle_suspension in the sync and async run loops.
  • 5 optional persister methods (save_suspension, load_suspension, save_journal_entry, load_journal, mark_suspension_resolved) with a bool contract on mark_*.
  • First-party persister overrides with dedicated tables/collections: in-memory, SQLite, psycopg2 (PostgreSQL), asyncpg, aiosqlite, redis (sync + async), pymongo. Deprecated shims inherit transparently.
  • Lifecycle hooks: PostActionSuspendHook + async, PreActionResumeHook + async. LocalTrackingClient emits SuspendEntryModel so the Burr UI can render status="suspended".
  • Persister status Literal["completed", "failed", "suspended"] propagated across the persistence layer.
  • Example at examples/durable-execution/ (runnable HITL workflow with notebook + statemachine.png) and concept docs at docs/concepts/durable-execution.rst.

Test plan

  • BURR_CI_INTEGRATION_TESTS=true POSTGRES_PORT=… REDIS_DB=… pytest tests/core/test_durable.py tests/core/test_durable_integration.py tests/core/test_durable_persisters.py90 passed with real Postgres, asyncpg, redis sync+async, pymongo, mongo shim backends.
  • pytest tests/core/ tests/lifecycle/ tests/tracking/ tests/integrations/persisters/ --ignore=tests/core/test_graphviz_display.py425 passed, no regressions. The 25 skipped are env-gated persister tests.
  • End-to-end real-LLM test with ollama (qwen2.5:1.5b) drives the HITL example, simulates a crash by deleting the in-memory app + persister, resumes from SQLite, and asserts the LLM is invoked exactly twice end-to-end — proving the journal prevents side-effect re-fire on replay.
  • Resume idempotency verified: second resume() call on the same suspension is a no-op (no LLM call, final state unchanged).
  • python examples/validate_examples.py accepts the new example directory.

Determinism contract (from the design)

  • key stable across re-runs per call site.
  • Same order, same set of durable calls across re-runs.
  • No durable behind a non-deterministic branch.
  • No suspend inside a durable fn.
  • Mismatch raises DeterminismError (fail-loud).

Notes

  • _Suspended inherits from BaseException — do not wrap __context.suspend() inside asyncio.shield(...) or task-cancellation guards that catch BaseException.
  • Custom persisters work through the in-state fallback (correct, not optimized for resume-once at concurrent scale).

…llback

Adds five durable-storage methods (save_suspension, load_suspension,
save_journal_entry, load_journal, mark_suspension_resolved) to
BaseStatePersister and AsyncBaseStatePersister with NotImplementedError
defaults, a real override on InMemoryPersister for tests, the
supports_durable_storage() capability helper, and an in-state fallback
codec in durable.py for persisters that do not override the methods.
Add docstring warnings about dataclasses.asdict not round-tripping nested
types in the in-state codec, document the all-or-nothing override contract
on supports_durable_storage, replace string annotations with real imports
in persistence.py (no circular import), and strengthen the
NotImplementedError test to call all 5 durable methods with real arguments.
Add four unit tests covering the in-state fallback codec functions
(suspension and journal round-trips, channel mismatch, JSON result
preservation). Correct the misleading load_suspension docstring on
BaseStatePersister and AsyncBaseStatePersister to reflect that the
method returns resolved and unresolved records alike. Add type
annotations to the five durable methods on InMemoryPersister to
match the base-class signatures.
…_journal return type

Add tests for mark_suspension_resolved (flag flip and unknown-id no-op), journal
insertion-order sorting, and tighten load_journal return annotation to list[JournalEntry]
on all three sites (requires-python >=3.9).
Docstring now accurately describes the dict-only coercion behavior. The
_context_factory method uses direct attribute access for all three durable
fields instead of mixing getattr with direct access. A comment marks the
intentional omission of _journal_call_index. New test covers schema_json
population on the first suspend call.
Wrap _handle_suspension calls in _step and _astep so that persister or
hook failures clear suspended_signal and fire post_run_step with the
real exception instead of falsely reporting a clean suspension. Also use
self._state in _astep's non-suspended finally branch to pick up state
mutations from delegated sync actions. Strengthen async suspension test
to assert persistence round-trip parity with the sync counterpart.
Remove the direct persister.save call inside _handle_suspension for the
in-state fallback branch. The post_run_step lifecycle hook (PersisterHook)
already saves the step row for every suspended step, so the inline save
was writing the same (partition_key, app_id, sequence_id, position) row
twice, causing an IntegrityError in SQLitePersister due to its UNIQUE
constraint. Remove the _UpsertSQLitePersister workaround subclass from
the test and use bare SQLitePersister directly to confirm the fix.
…jsonschema

- Remove dead `record.resolved = True` mutation in in-state fallback path of
  resume() and aresume(); replace with comment naming the no-durability rule.
- Expand docstrings on resume() and aresume() to distinguish durable-storage
  idempotency (no-op) from in-state fallback behavior (second call raises).
- Tighten no-record ValueError message to name the in-state fallback cause,
  distinguishing it from a wrong app_id.
- _validate_payload now emits a warnings.warn instead of silently skipping
  when jsonschema is absent; import warnings moved to module level.
- Add M5 deferral comment in application._handle_suspension.
- Add test_resume_in_state_fallback_second_call_raises to integration suite.
…aresume

Guard aresume load_suspension call with supports_durable_storage check,
mirroring the existing guard used for journal loading in the same function.
Without the guard, async persisters that do not override load_suspension
raised NotImplementedError instead of falling through to _load_suspension.
Also raise warnings.warn stacklevel from 2 to 3 in _validate_payload so
the warning points at the caller of resume/aresume, not the internal helper.
Replaces the silent broken path (TypeError: coroutine object is not
subscriptable) with an explicit NotImplementedError when aresume() is
called with an async persister that does not implement durable storage.
Updates the aresume() docstring to accurately describe the async/sync
paths and their idempotency guarantees. Adds a test to assert the guard.
…State

aresume() now raises NotImplementedError for any async persister upfront,
removing unreachable dead branches. Both resume() and aresume() return the
loaded State object directly instead of wrapping it in State() again.
Add test_journal_no_double_count_via_stream_result to verify that
step_a's journal entry is not duplicated when stream_result() fast-
forwards through a non-halt_after action then executes the target
non-streaming action directly.  Reverting the self._journal_sink = []
reset at line 1744 of application.py causes this test to observe 3
journal entries (a_calc, a_calc, b_calc) instead of the correct 2.
Add create_durable_tables_if_not_exist and the 5 durable methods
(save_suspension, load_suspension, mark_suspension_resolved,
save_journal_entry, load_journal) to PostgreSQLPersister in
b_psycopg2.py, mirroring the SQLitePersister implementation with
Postgres dialect adjustments (%s placeholders, ON CONFLICT upserts,
IS NOT DISTINCT FROM for NULL-safe partition_key equality).
Extend test_durable_persisters.py with a Postgres block that skips
unless BURR_CI_INTEGRATION_TESTS=true.
The pg_persister fixture was hardcoded to localhost:5432, which made
it impossible to run against a Postgres on a non-default port without
editing the file. Honor POSTGRES_HOST/PORT/USER/PASSWORD/DB env vars
(with the previous values as defaults), so CI and local Docker setups
both work.

Add a tiny test that confirms the deprecated postgresql.py shim
inherits durable-storage support from the canonical b_psycopg2
persister without re-declaring methods.
Spec-compliance pass left a few quality gaps in the Postgres durable
methods: parameter type hints were stripped, return types were loose
('list' vs 'list[JournalEntry]'), and 'serde', 'json',
'SuspensionRecord' and 'JournalEntry' were re-imported inside every
method body even though no circular import constraint requires it.

Lift the imports to module top, tighten signatures to match the SQLite
reference, and drop a misleading F401 type-reference comment that
never matched a real annotation. Also drop the persister's state table
in the test fixture teardown so future state-table writes can't leak
between runs.
Remove the NotImplementedError guard for async persisters and add
_aload_suspension, _aload_journal, _arebuild async helpers that handle
all four combos (durable/non-durable x async/sync). aresume() now
awaits async persister calls and branches to sync calls for sync
persisters throughout the load/journal/rebuild/mark-resolved path.
Remove the temporary try/except ValueError guards around post_action_suspend
in _handle_suspension and _ahandle_suspension now that the hooks are
registered. Extend resume()/aresume() with an optional hooks parameter,
thread it through _rebuild/_arebuild, and fire pre_action_resume before
re-running the action. Covers sync post_suspend, sync pre_resume and async
pre_resume with three new tests.
Adds SuspendEntryModel to the tracking models and implements
PostActionSuspendHook on SyncTrackingClient so that a suspend_entry
line is written to the JSONL log whenever an action suspends the run,
enabling the Burr UI to render the suspension status.
Adds examples/durable-execution/ with a draft-review-finalize workflow
demonstrating suspend/resume and durable() journaling. Includes
application.py, notebook.ipynb, README.md, requirements.txt, __init__.py,
and a real statemachine.png generated by graphviz. Extends
test_durable_integration.py with test_example_application_suspends_and_resumes
which loads the example module and exercises the full suspend/resume path
against a tmp_path SQLite DB.
@github-actions github-actions Bot added area/core Application, State, Graph, Actions area/storage Persisters, state storage area/hooks Lifecycle hooks, interceptors area/tracking Telemetry, tracing, OpenTelemetry area/integrations External integrations (LLMs, frameworks) area/website burr.apache.org website area/examples Relates to /examples pr/needs-rebase Conflicts with main labels May 23, 2026
@andreahlert andreahlert self-assigned this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Application, State, Graph, Actions area/examples Relates to /examples area/hooks Lifecycle hooks, interceptors area/integrations External integrations (LLMs, frameworks) area/storage Persisters, state storage area/tracking Telemetry, tracing, OpenTelemetry area/website burr.apache.org website pr/needs-rebase Conflicts with main

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant