Fix #1897 recovery replay request storm by TianchunGu · Pull Request #1939 · MemTensor/MemOS

TianchunGu · 2026-06-17T06:57:06Z

Summary

Follow-up for #1897, stacked on top of #1938.

#1899 / #1938 stop the terminal-provider-error retry storm at the LLM facade. This PR adds a second guardrail for the capture pipeline: startup recovery of dirty closed episodes can replay a large historical episode, misclassify recovered steps as orphan traces, and then trigger many reflect/summarize LLM calls.

This patch:

fetches all episode traces for capture reconciliation instead of using the paginated viewer list() path;
matches reflect replay steps by stable content signature, with a timing-insensitive fallback for recovered tool-call traces that lack startedAt/endedAt;
skips recovered replay orphan inserts by default (maxRecoveryOrphanInserts: 0) to avoid duplicating historical traces;
avoids LLM summarization for any recovered replay orphan insert that an operator explicitly allows;
adds a hard per-episode reflect LLM budget (maxReflectLlmCalls, default 128);
stops remaining reflect LLM attempts after terminal/circuit-open provider errors.

The goal is to keep normal short-episode reflection behavior intact while preventing a single dirty recovered episode from producing unbounded paid LLM calls.

Validation

From apps/memos-local-plugin:

npx vitest run tests/unit/capture/capture.test.ts tests/unit/capture/capture-batch.test.ts tests/unit/capture/normalizer.test.ts tests/unit/llm/client.test.ts
# Test Files 4 passed (4)
# Tests 57 passed (57)

npm run lint
# tsc -p tsconfig.json --noEmit
# passed

New regression coverage:

recovered replay matches tool traces by payload when timestamps drift;
startup-recovered replay orphans are not inserted by default;
reflect-phase LLM calls are capped per episode.

Related: #1897, #1899, #1938

…for terminal provider errors Issue MemTensor#1897 reported ~12,900 paid LLM requests in 24 h on Hermes against a DeepSeek key with insufficient balance. The local `system_model_status` row count (12,900) closely tracked the provider-side `request_count` (11,344) for the same billing window. The naming is misleading: `system_model_status` is not a health probe; it is the audit row written once per LLM call (ok / fallback / error) inside `core/llm/client.ts`. With no circuit breaker, every pipeline subscriber (capture / session-relation / reward / L2 / L3 / skill / retrieval LLM filter / world-model) kept firing on every turn / closed episode / induction, generating one paid request each. Add a per-`LlmClient` circuit breaker: - Trips on terminal errors: HTTP 401/402/403 or messages containing `insufficient balance` / `invalid api key` / `unauthorized` / `account suspended` / `billing`. - Open: short-circuits subsequent calls inside the facade without contacting the provider. Throws `MemosError(LLM_UNAVAILABLE)` with `details.circuitOpen=true` so existing catch blocks still work. - Half-open after cool-down (default 5 min, configurable, min 30 s): next call probes the provider; success closes the breaker, terminal failure re-opens it for another cool-down. - Host fallback rescues a call without tripping the breaker — fallback exists precisely to keep going when the primary is down. - Coalesces `system_model_status="circuit_open"` audit rows to at most one per ~25 s while the breaker stays open, so we don't replace paid spam with audit-row spam. - Exposes `circuitOpen` / `circuitOpenUntil` / `circuitOpenedReason` via `LlmClientStats` for the Overview viewer card. - Enabled by default; legacy behaviour available via `circuitBreaker.enabled = false`. Tests: 9 new vitest cases under `tests/unit/llm/client.test.ts` covering trip on 402, trip on "insufficient balance" message, no trip on generic transient, coalescing, half-open close on success, host-fallback rescues without trip, disabled mode, stats fields, and re-open on terminal probe failure. All 59 LLM and 28 pipeline tests pass; `tsc --noEmit` clean. Out of scope (tracked separately): 429 `Retry-After` handling (issue MemTensor#1620), per-tool rate limits, daily budget caps.

Memtensor-AI · 2026-07-02T11:10:52Z

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

Run: tr-f54600e4-cbf on cloud test-engine 10011
memos_local_plugin/unit: 46 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

autodev-bot and others added 4 commits June 8, 2026 20:34

Merge branch 'dev-20260604-v2.0.19' into bugfix/autodev-1897

f3583fb

Fix LLM breaker with host fallback

b9b2457

fix(plugin): bound recovery reflect replay LLM calls

5316950

TianchunGu mentioned this pull request Jun 17, 2026

bug(memos-local-plugin): Hermes background model status checks can trigger paid LLM request storm #1897

Closed

Memtensor-AI changed the base branch from bugfix/autodev-1897 to dev-v2.0.22 July 1, 2026 13:15

CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25

CarltonXiang closed this Jul 3, 2026

syzsunshine219 reopened this Jul 3, 2026

syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026

syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #1897 recovery replay request storm#1939

Fix #1897 recovery replay request storm#1939
TianchunGu wants to merge 4 commits into
MemTensor:mainfrom
TianchunGu:codex/fix-1897-recovery-storm

TianchunGu commented Jun 17, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

TianchunGu commented Jun 17, 2026

Summary

Validation

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants