feat: durable execution (suspend/resume + ctx.durable journal)#786
Open
andreahlert wants to merge 57 commits into
Open
feat: durable execution (suspend/resume + ctx.durable journal)#786andreahlert wants to merge 57 commits into
andreahlert wants to merge 57 commits into
Conversation
…llback Adds five durable-storage methods (save_suspension, load_suspension, save_journal_entry, load_journal, mark_suspension_resolved) to BaseStatePersister and AsyncBaseStatePersister with NotImplementedError defaults, a real override on InMemoryPersister for tests, the supports_durable_storage() capability helper, and an in-state fallback codec in durable.py for persisters that do not override the methods.
Add docstring warnings about dataclasses.asdict not round-tripping nested types in the in-state codec, document the all-or-nothing override contract on supports_durable_storage, replace string annotations with real imports in persistence.py (no circular import), and strengthen the NotImplementedError test to call all 5 durable methods with real arguments.
Add four unit tests covering the in-state fallback codec functions (suspension and journal round-trips, channel mismatch, JSON result preservation). Correct the misleading load_suspension docstring on BaseStatePersister and AsyncBaseStatePersister to reflect that the method returns resolved and unresolved records alike. Add type annotations to the five durable methods on InMemoryPersister to match the base-class signatures.
…_journal return type Add tests for mark_suspension_resolved (flag flip and unknown-id no-op), journal insertion-order sorting, and tighten load_journal return annotation to list[JournalEntry] on all three sites (requires-python >=3.9).
Docstring now accurately describes the dict-only coercion behavior. The _context_factory method uses direct attribute access for all three durable fields instead of mixing getattr with direct access. A comment marks the intentional omission of _journal_call_index. New test covers schema_json population on the first suspend call.
Wrap _handle_suspension calls in _step and _astep so that persister or hook failures clear suspended_signal and fire post_run_step with the real exception instead of falsely reporting a clean suspension. Also use self._state in _astep's non-suspended finally branch to pick up state mutations from delegated sync actions. Strengthen async suspension test to assert persistence round-trip parity with the sync counterpart.
Remove the direct persister.save call inside _handle_suspension for the in-state fallback branch. The post_run_step lifecycle hook (PersisterHook) already saves the step row for every suspended step, so the inline save was writing the same (partition_key, app_id, sequence_id, position) row twice, causing an IntegrityError in SQLitePersister due to its UNIQUE constraint. Remove the _UpsertSQLitePersister workaround subclass from the test and use bare SQLitePersister directly to confirm the fix.
…jsonschema - Remove dead `record.resolved = True` mutation in in-state fallback path of resume() and aresume(); replace with comment naming the no-durability rule. - Expand docstrings on resume() and aresume() to distinguish durable-storage idempotency (no-op) from in-state fallback behavior (second call raises). - Tighten no-record ValueError message to name the in-state fallback cause, distinguishing it from a wrong app_id. - _validate_payload now emits a warnings.warn instead of silently skipping when jsonschema is absent; import warnings moved to module level. - Add M5 deferral comment in application._handle_suspension. - Add test_resume_in_state_fallback_second_call_raises to integration suite.
…aresume Guard aresume load_suspension call with supports_durable_storage check, mirroring the existing guard used for journal loading in the same function. Without the guard, async persisters that do not override load_suspension raised NotImplementedError instead of falling through to _load_suspension. Also raise warnings.warn stacklevel from 2 to 3 in _validate_payload so the warning points at the caller of resume/aresume, not the internal helper.
Replaces the silent broken path (TypeError: coroutine object is not subscriptable) with an explicit NotImplementedError when aresume() is called with an async persister that does not implement durable storage. Updates the aresume() docstring to accurately describe the async/sync paths and their idempotency guarantees. Adds a test to assert the guard.
…State aresume() now raises NotImplementedError for any async persister upfront, removing unreachable dead branches. Both resume() and aresume() return the loaded State object directly instead of wrapping it in State() again.
Add test_journal_no_double_count_via_stream_result to verify that step_a's journal entry is not duplicated when stream_result() fast- forwards through a non-halt_after action then executes the target non-streaming action directly. Reverting the self._journal_sink = [] reset at line 1744 of application.py causes this test to observe 3 journal entries (a_calc, a_calc, b_calc) instead of the correct 2.
Add create_durable_tables_if_not_exist and the 5 durable methods (save_suspension, load_suspension, mark_suspension_resolved, save_journal_entry, load_journal) to PostgreSQLPersister in b_psycopg2.py, mirroring the SQLitePersister implementation with Postgres dialect adjustments (%s placeholders, ON CONFLICT upserts, IS NOT DISTINCT FROM for NULL-safe partition_key equality). Extend test_durable_persisters.py with a Postgres block that skips unless BURR_CI_INTEGRATION_TESTS=true.
The pg_persister fixture was hardcoded to localhost:5432, which made it impossible to run against a Postgres on a non-default port without editing the file. Honor POSTGRES_HOST/PORT/USER/PASSWORD/DB env vars (with the previous values as defaults), so CI and local Docker setups both work. Add a tiny test that confirms the deprecated postgresql.py shim inherits durable-storage support from the canonical b_psycopg2 persister without re-declaring methods.
Spec-compliance pass left a few quality gaps in the Postgres durable
methods: parameter type hints were stripped, return types were loose
('list' vs 'list[JournalEntry]'), and 'serde', 'json',
'SuspensionRecord' and 'JournalEntry' were re-imported inside every
method body even though no circular import constraint requires it.
Lift the imports to module top, tighten signatures to match the SQLite
reference, and drop a misleading F401 type-reference comment that
never matched a real annotation. Also drop the persister's state table
in the test fixture teardown so future state-table writes can't leak
between runs.
Remove the NotImplementedError guard for async persisters and add _aload_suspension, _aload_journal, _arebuild async helpers that handle all four combos (durable/non-durable x async/sync). aresume() now awaits async persister calls and branches to sync calls for sync persisters throughout the load/journal/rebuild/mark-resolved path.
Remove the temporary try/except ValueError guards around post_action_suspend in _handle_suspension and _ahandle_suspension now that the hooks are registered. Extend resume()/aresume() with an optional hooks parameter, thread it through _rebuild/_arebuild, and fire pre_action_resume before re-running the action. Covers sync post_suspend, sync pre_resume and async pre_resume with three new tests.
Adds SuspendEntryModel to the tracking models and implements PostActionSuspendHook on SyncTrackingClient so that a suspend_entry line is written to the JSONL log whenever an action suspends the run, enabling the Burr UI to render the suspension status.
Adds examples/durable-execution/ with a draft-review-finalize workflow demonstrating suspend/resume and durable() journaling. Includes application.py, notebook.ipynb, README.md, requirements.txt, __init__.py, and a real statemachine.png generated by graphviz. Extends test_durable_integration.py with test_example_application_suspends_and_resumes which loads the example module and exercises the full suspend/resume path against a tmp_path SQLite DB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds durable execution to Burr, unifying two approaches in one design:
__context.suspend(channel, schema=, metadata=). The signal propagates through_Suspended(BaseException), is persisted as aSuspensionRecord, and replayed byresume(...)/aresume(...).__context.durable(key, fn, ...)/__context.adurable(...)sub-step journal. Memoizes side effects across re-runs, withDeterminismErrorraised fail-loud on mismatch.Closes the gap where Burr could not natively pause for human input, external events, or crash recovery without manual state-fork workarounds.
What's new
burr/core/durable.py—_Suspended,SuspensionRecord,JournalEntry,DeterminismError,supports_durable_storage().burr/core/resume.py—resume()andaresume()rebuild the app fromgraph+persister, deliver the payload, run to the next halt.ApplicationContext.suspend / durable / adurable+_handle_suspension/_ahandle_suspensionin the sync and async run loops.save_suspension,load_suspension,save_journal_entry,load_journal,mark_suspension_resolved) with aboolcontract onmark_*.PostActionSuspendHook+ async,PreActionResumeHook+ async.LocalTrackingClientemitsSuspendEntryModelso the Burr UI can renderstatus="suspended".Literal["completed", "failed", "suspended"]propagated across the persistence layer.examples/durable-execution/(runnable HITL workflow with notebook + statemachine.png) and concept docs atdocs/concepts/durable-execution.rst.Test plan
BURR_CI_INTEGRATION_TESTS=true POSTGRES_PORT=… REDIS_DB=… pytest tests/core/test_durable.py tests/core/test_durable_integration.py tests/core/test_durable_persisters.py— 90 passed with real Postgres, asyncpg, redis sync+async, pymongo, mongo shim backends.pytest tests/core/ tests/lifecycle/ tests/tracking/ tests/integrations/persisters/ --ignore=tests/core/test_graphviz_display.py— 425 passed, no regressions. The 25 skipped are env-gated persister tests.qwen2.5:1.5b) drives the HITL example, simulates a crash by deleting the in-memory app + persister, resumes from SQLite, and asserts the LLM is invoked exactly twice end-to-end — proving the journal prevents side-effect re-fire on replay.resume()call on the same suspension is a no-op (no LLM call, final state unchanged).python examples/validate_examples.pyaccepts the new example directory.Determinism contract (from the design)
keystable across re-runs per call site.durablecalls across re-runs.durablebehind a non-deterministic branch.suspendinside adurablefn.DeterminismError(fail-loud).Notes
_Suspendedinherits fromBaseException— do not wrap__context.suspend()insideasyncio.shield(...)or task-cancellation guards that catchBaseException.