Fix #1776: Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout c#2000
Fix #1776: Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout c#2000Memtensor-AI wants to merge 1 commit into
Conversation
…1776) memos-local-plugin's `core.init()` synchronously awaited the entire reflect / reward / L2 chain for stale orphan episodes left over from the previous process. With the configured `timeoutMs=45000` and three typical orphans, the chain accumulated ~100 s of LLM round-trips before `startHttpServer(...)` was allowed to run, so the OpenClaw gateway was unreachable for the whole window. Split the work in `createMemoryCore`: - `init()` still does the cheap synchronous classification (lightweight close + `topicState=interrupted` meta update for recent orphans + list building) so the next user turn is correctly routed. - The slow `recoverOpenEpisodesAsSessionEnd` + `recoverDirtyClosedEpisodes` calls are scheduled on a module-scoped `startupRecoveryPromise` and swallowed via `.catch(log.warn)` so they can never wedge shutdown. - A new optional `MemoryCore.waitForStartupRecovery()` lets tests and shutdown await that background work explicitly. Production adapters (`adapters/openclaw/index.ts`, `bridge.cts`) intentionally skip it so the viewer comes up immediately. - `shutdown()` awaits the background promise before tearing down so the in-flight reflect listeners don't hit a closed SQLite handle. Adds 4 new unit tests covering the new contract (fast init, observable wait, no-op when empty, shutdown drain) and threads `waitForStartupRecovery?.()` into the 3 existing orphan-recovery tests that depend on the slow path completing. Test results: tests/unit/pipeline/memory-core.test.ts 32/32 passed related adapter/server/bridge suites 174/174 passed tsc --noEmit clean The two pre-existing failures in tests/unit/storage/{migrator, traces-count}.test.ts reproduce on the unchanged base branch and are unrelated to this fix.
✅ Automated Test Results: PASSEDNo applicable test scope for the changed files — automated tests skipped. Changed paths do not map to any configured scope (env.yaml source_mapping). Manual review recommended. Branch: |
|
Closure recommendation: DO NOT MERGE as-is — superseded by #2002.\n\nThis PR fixes the non-blocking startup recovery path for #1776, but #2002 now includes the same startup-recovery mechanism plus the additional dirty-rescore failure backoff for #1808. I resolved and cloud-tested #2002 against dev-v2.0.22 instead:\n\n- #2002 merge conflict resolved\n- Cloud test-engine run tr-9960eb48-574 PASSED\n- Scope: memos_local_plugin\n- Result: 33/33 tests passed\n\nKeeping both PRs open/mergeable would duplicate the same memory-core startup recovery changes and increase conflict risk. Recommendation: close #2000 after confirming #2002 is the intended replacement. |
Description
Fixed issue #1776: memos-local-plugin's
core.init()no longer blocks for ~100s on synchronous orphan-episode reflect/reward/L2 work. Root cause was the twoawait recoverOpenEpisodesAsSessionEnd / recoverDirtyClosedEpisodescalls insidecreateMemoryCore.init()— each stale orphan fanned out into reflect (LLM 45s × 3 retries) → reward → L2 induction, all gated beforestartHttpServercould bind.Solution: split
init()so the cheap synchronous classification (lightweight close +topicState=interruptedmeta updates for recent open episodes + list building) stays inline, while the slow recovery is scheduled on a module-scopedstartupRecoveryPromisewhose errors are swallowed via.catch(log.warn). Added a new optionalMemoryCore.waitForStartupRecovery()so tests and graceful shutdown can opt into draining the background work; production adapters (adapters/openclaw/index.ts,bridge.cts) intentionally skip it so the HTTP viewer starts immediately.shutdown()now awaits the background promise before tearing down to avoid SQLite-misuse from mid-flight reflect listeners. Added two new log keys:init.background_recovery_startedandinit.background_recovery_failed.Test evidence: 4 new unit tests under
describe("issue #1776 — non-blocking startup recovery")cover fastinit()(returns within 500ms even with 3 stale orphans seeded), observable wait, no-op-on-empty, and shutdown-drain. The 3 existing orphan-recovery tests gained a singleawait core.waitForStartupRecovery?.()line. Test results: tests/unit/pipeline/memory-core.test.ts 32/32 passed; related adapter/server/bridge suites 174/174 passed; full unit suite 1044/1047 passed;tsc --noEmitclean. The 2 remaining failures (tests/unit/storage/{migrator, traces-count}.test.ts) reproduce on the unchanged base branch and are unrelated to this fix.Files changed: apps/memos-local-plugin/agent-contract/memory-core.ts (interface), apps/memos-local-plugin/core/pipeline/memory-core.ts (init/shutdown refactor + new waitForStartupRecovery), apps/memos-local-plugin/tests/unit/pipeline/memory-core.test.ts (4 new tests + 3 updated). Branch pushed to origin/bugfix/autodev-1776; opsp artifacts (proposal/spec/design/verification-report/task) archived to memos-autodev-specs main.
Related Issue (Required): Fixes #1776
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Automated tests are pending.
Checklist
@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.
Reviewer Checklist