Fix #1808: fix: [memos-local-plugin] Dreaming background processing starves Gateway event l#2002
Conversation
…op + backoff failed dirty rescores (#1808) `core.init()` in memos-local-plugin used to synchronously await the entire reflect → reward → L2-induction chain for every orphan and dirty closed episode discovered in SQLite. On the 30K-trace / 324MB database described in issue #1808 this blocked the OpenClaw Gateway's main Node.js event loop for 3-5s+, well past the WebSocket read-probe 3s budget — TUI and Control UI clients could no longer connect. The fix: * Split `init()` into a fast synchronous classification (lightweight orphan close + `topicState=interrupted` meta updates + stale / dirty bucket assembly) and a background `startupRecoveryPromise` that runs the slow reflect/reward chain off the critical path. Adapter `await core.init()` now returns in milliseconds regardless of the database size. * Expose `MemoryCore.waitForStartupRecovery?()` so tests and one-shot batch tools can opt back into the historic "await everything" semantics. The promise is contracted to never reject; failures are logged on the `init.background_recovery_failed` channel. * `shutdown()` now awaits the promise before tearing down storage so the SQLite handle is not closed mid-flush. * Add per-episode failure tracking (`meta.rewardDirty.failedAttempts` / `lastFailureAt`) to dirty closed episodes. After `MAX_DIRTY_REWARD_ATTEMPTS=3` consecutive automatic rescores the episode enters an exponential backoff (1h → cap 24h) before the init scan / 10-min periodic timer touches it again. Manual feedback / `runManually` still rescore unconditionally. This closes the "retried indefinitely with no backoff" sub-symptom reported on the same issue. Tests: pipeline+bridge+adapters unit suite 127/127 green, full unit suite 1044/1047 (the 2 unrelated `storage/traces-count` + `storage/migrator` failures reproduce on the base branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ Automated Test Results: PASSEDNo applicable test scope for the changed files — automated tests skipped. Changed paths do not map to any configured scope (env.yaml source_mapping). Manual review recommended. Branch: |
❌ Automated Test Results: FAILED
Failed tests:
Error detailsBranch: |
|
Automated Test Results: PASSED\n\nCloud test-engine rerun after resolving the dev-v2.0.22 merge conflict.\n\nRun: tr-9960eb48-574\nScope: memos_local_plugin\nResult: 33/33 tests passed\nCommand group: memos_local_plugin/unit\nDuration: 11s\n\nLocal pre-push verification also passed: npm run build, plus focused vitest for memory-core startup recovery/backoff tests including the #1966 ghost-trace guard.\n\nStatus: merge conflict resolved; automated scope test passed. Manual code review is still required before merge. |
Description
Fix issue #1808 — memos-local-plugin no longer starves the OpenClaw Gateway event loop during plugin bootstrap.
Root cause:
core.init()inapps/memos-local-plugin/core/pipeline/memory-core.tssynchronously awaited the full reflect → reward → L2-induction chain for every orphan / dirty closed episode in SQLite. On the reporter's 30K-trace / 324MB database that pinned the Node.js event loop for 3-5s+, causing OpenClaw's 3s WebSocket read probe to time out so TUI and Control UI clients could not connect. A secondary symptom — failed LLM rescores being retried indefinitely with no backoff — kept the loop alive across restarts and the 10-minute periodic rescan.Fix delivered on
bugfix/autodev-1808(commit 26c96b0): (1) Splitinit()into a fast synchronous orphan / dirty classification and a backgroundstartupRecoveryPromisethat runs the slow reflect/reward chain off the Gateway's critical path; (2) Added an optionalMemoryCore.waitForStartupRecovery?()method so tests and one-shot batch tools can opt back into the historic "await everything" semantics — the promise is contracted never to reject; (3)shutdown()now awaits the background promise before tearing down storage so SQLite is not closed mid-flush; (4) Introduced per-episode failure tracking undermeta.rewardDirty— afterMAX_DIRTY_REWARD_ATTEMPTS=3consecutive failed automatic rescores the episode enters an exponential backoff (1h → 24h cap) and the init + periodic scans skip it until the cooldown elapses; manual feedback /runManuallystill rescore unconditionally.Tests:
tests/unit/pipeline/memory-core.test.tsnow 32/32 green (3 existing orphan/dirty tests updated to await the new promise; 4 net new tests covering init latency contract, backoff filter, failure-counter logic, and shutdown synchronization). The pipeline + bridge + adapters slice passes 127/127. Full unit sweep is 1044/1047 — the 2 unrelated failures intests/unit/storage/traces-count.test.tsandmigrator.test.tsreproduce identically on the unmodified base branch (verified via stash roundtrip) and are out of scope for this bug. TypeScripttsc --noEmitis clean for bothtsconfig.jsonandtsconfig.build.json. The Python backend, LLM client timeouts, and Makefile/CI are unchanged.Related Issue (Required): Fixes #1808
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Automated tests are pending.
Checklist
@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.
Reviewer Checklist