Fix #1808: fix: [memos-local-plugin] Dreaming background processing starves Gateway event l by Memtensor-AI · Pull Request #2002 · MemTensor/MemOS

Memtensor-AI · 2026-06-30T17:46:48Z

Description

Fix issue #1808 — memos-local-plugin no longer starves the OpenClaw Gateway event loop during plugin bootstrap.

Root cause: core.init() in apps/memos-local-plugin/core/pipeline/memory-core.ts synchronously awaited the full reflect → reward → L2-induction chain for every orphan / dirty closed episode in SQLite. On the reporter's 30K-trace / 324MB database that pinned the Node.js event loop for 3-5s+, causing OpenClaw's 3s WebSocket read probe to time out so TUI and Control UI clients could not connect. A secondary symptom — failed LLM rescores being retried indefinitely with no backoff — kept the loop alive across restarts and the 10-minute periodic rescan.

Fix delivered on bugfix/autodev-1808 (commit 26c96b0): (1) Split init() into a fast synchronous orphan / dirty classification and a background startupRecoveryPromise that runs the slow reflect/reward chain off the Gateway's critical path; (2) Added an optional MemoryCore.waitForStartupRecovery?() method so tests and one-shot batch tools can opt back into the historic "await everything" semantics — the promise is contracted never to reject; (3) shutdown() now awaits the background promise before tearing down storage so SQLite is not closed mid-flush; (4) Introduced per-episode failure tracking under meta.rewardDirty — after MAX_DIRTY_REWARD_ATTEMPTS=3 consecutive failed automatic rescores the episode enters an exponential backoff (1h → 24h cap) and the init + periodic scans skip it until the cooldown elapses; manual feedback / runManually still rescore unconditionally.

Tests: tests/unit/pipeline/memory-core.test.ts now 32/32 green (3 existing orphan/dirty tests updated to await the new promise; 4 net new tests covering init latency contract, backoff filter, failure-counter logic, and shutdown synchronization). The pipeline + bridge + adapters slice passes 127/127. Full unit sweep is 1044/1047 — the 2 unrelated failures in tests/unit/storage/traces-count.test.ts and migrator.test.ts reproduce identically on the unmodified base branch (verified via stash roundtrip) and are out of scope for this bug. TypeScript tsc --noEmit is clean for both tsconfig.json and tsconfig.build.json. The Python backend, LLM client timeouts, and Makefile/CI are unchanged.

Related Issue (Required): Fixes #1808

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g. code style improvements, linting)
Documentation update

How Has This Been Tested?

Automated tests are pending.

Unit Test
Test Script Or Test Steps (please provide)
Pipeline Automated API Test (please provide)

Checklist

I have performed a self-review of my own code
I have commented my code in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have created related documentation issue/PR in MemOS-Docs (if applicable)
I have linked the issue to this PR (if applicable)
I have mentioned the person who will review this PR

@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.

Reviewer Checklist

closes fix: [memos-local-plugin] Dreaming background processing starves Gateway event loop, causing WebSocket timeouts #1808
Made sure Checks passed
Tests have been provided

…op + backoff failed dirty rescores (#1808) `core.init()` in memos-local-plugin used to synchronously await the entire reflect → reward → L2-induction chain for every orphan and dirty closed episode discovered in SQLite. On the 30K-trace / 324MB database described in issue #1808 this blocked the OpenClaw Gateway's main Node.js event loop for 3-5s+, well past the WebSocket read-probe 3s budget — TUI and Control UI clients could no longer connect. The fix: * Split `init()` into a fast synchronous classification (lightweight orphan close + `topicState=interrupted` meta updates + stale / dirty bucket assembly) and a background `startupRecoveryPromise` that runs the slow reflect/reward chain off the critical path. Adapter `await core.init()` now returns in milliseconds regardless of the database size. * Expose `MemoryCore.waitForStartupRecovery?()` so tests and one-shot batch tools can opt back into the historic "await everything" semantics. The promise is contracted to never reject; failures are logged on the `init.background_recovery_failed` channel. * `shutdown()` now awaits the promise before tearing down storage so the SQLite handle is not closed mid-flush. * Add per-episode failure tracking (`meta.rewardDirty.failedAttempts` / `lastFailureAt`) to dirty closed episodes. After `MAX_DIRTY_REWARD_ATTEMPTS=3` consecutive automatic rescores the episode enters an exponential backoff (1h → cap 24h) before the init scan / 10-min periodic timer touches it again. Manual feedback / `runManually` still rescore unconditionally. This closes the "retried indefinitely with no backoff" sub-symptom reported on the same issue. Tests: pipeline+bridge+adapters unit suite 127/127 green, full unit suite 1044/1047 (the 2 unrelated `storage/traces-count` + `storage/migrator` failures reproduce on the base branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Memtensor-AI · 2026-06-30T17:47:26Z

✅ Automated Test Results: PASSED

No applicable test scope for the changed files — automated tests skipped. Changed paths do not map to any configured scope (env.yaml source_mapping). Manual review recommended.

Branch: bugfix/autodev-1808

Memtensor-AI · 2026-07-01T09:20:57Z

❌ Automated Test Results: FAILED

Auto-fix retry 1/2 triggered.

Failed tests:

test_out_of_range_rejected_or_clamped_to_valid[negative_1]
test_out_of_range_rejected_or_clamped_to_valid[negative_60s]
test_out_of_range_rejected_or_clamped_to_valid[negative_one_day]
test_out_of_range_rejected_or_clamped_to_valid[max_plus_1]
test_out_of_range_rejected_or_clamped_to_valid[max_plus_one_day]
test_out_of_range_rejected_or_clamped_to_valid[hundred_x_max]
test_invalid_type_does_not_crash_or_corrupt[string_number]
test_invalid_type_does_not_crash_or_corrupt[string_text]
test_invalid_type_does_not_crash_or_corrupt[none_value]
test_invalid_type_does_not_crash_or_corrupt[dict_value]

Error details

The vector_scan max_age setter accepts invalid values (negative numbers, values exceeding the documented max, and non-numeric types like strings/None/dicts) and persists them verbatim instead of rejecting or clamping to the [0, 31536000000] ms range. AI-generated tests: 30/30 passed.

Branch: bugfix/autodev-1808

Memtensor-AI · 2026-07-02T09:24:31Z

Automated Test Results: PASSED\n\nCloud test-engine rerun after resolving the dev-v2.0.22 merge conflict.\n\nRun: tr-9960eb48-574\nScope: memos_local_plugin\nResult: 33/33 tests passed\nCommand group: memos_local_plugin/unit\nDuration: 11s\n\nLocal pre-push verification also passed: npm run build, plus focused vitest for memory-core startup recovery/backoff tests including the #1966 ghost-trace guard.\n\nStatus: merge conflict resolved; automated scope test passed. Manual code review is still required before merge.

Memtensor-AI assigned CarltonXiang, MatthewZhuang, syzsunshine219 and World-controller Jun 30, 2026

Memtensor-AI requested review from CarltonXiang, MatthewZhuang, World-controller and syzsunshine219 June 30, 2026 17:46

Memtensor-AI added ai-generated bug Something isn't working | 功能异常 labels Jun 30, 2026

Memtensor-AI mentioned this pull request Jun 30, 2026

fix: [memos-local-plugin] Dreaming background processing starves Gateway event loop, causing WebSocket timeouts #1808

Closed

5 tasks

syzsunshine219 changed the base branch from dev-20260624-v2.0.22 to dev-v2.0.22 July 1, 2026 07:14

CarltonXiang added the plugin Plugin/adapter/bridge layer (apps/ directory) | 插件/适配层 label Jul 2, 2026

Merge dev-v2.0.22 into autodev-1808

424c6dd

Memtensor-AI mentioned this pull request Jul 2, 2026

Fix #1776: Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout c #2000

Closed

17 tasks

syzsunshine219 merged commit 4e16c68 into dev-v2.0.22 Jul 2, 2026
16 checks passed

syzsunshine219 deleted the bugfix/autodev-1808 branch July 2, 2026 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #1808: fix: [memos-local-plugin] Dreaming background processing starves Gateway event l#2002

Fix #1808: fix: [memos-local-plugin] Dreaming background processing starves Gateway event l#2002
syzsunshine219 merged 2 commits into
dev-v2.0.22from
bugfix/autodev-1808

Memtensor-AI commented Jun 30, 2026

Uh oh!

Memtensor-AI commented Jun 30, 2026

Uh oh!

Memtensor-AI commented Jul 1, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

Memtensor-AI commented Jun 30, 2026

Description

Type of change

How Has This Been Tested?

Checklist

Reviewer Checklist

Uh oh!

Memtensor-AI commented Jun 30, 2026

✅ Automated Test Results: PASSED

Uh oh!

Memtensor-AI commented Jul 1, 2026

❌ Automated Test Results: FAILED

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants