Skip to content

Fix #1897 recovery replay request storm#1939

Open
TianchunGu wants to merge 4 commits into
MemTensor:mainfrom
TianchunGu:codex/fix-1897-recovery-storm
Open

Fix #1897 recovery replay request storm#1939
TianchunGu wants to merge 4 commits into
MemTensor:mainfrom
TianchunGu:codex/fix-1897-recovery-storm

Conversation

@TianchunGu

Copy link
Copy Markdown
Contributor

Summary

Follow-up for #1897, stacked on top of #1938.

#1899 / #1938 stop the terminal-provider-error retry storm at the LLM facade. This PR adds a second guardrail for the capture pipeline: startup recovery of dirty closed episodes can replay a large historical episode, misclassify recovered steps as orphan traces, and then trigger many reflect/summarize LLM calls.

This patch:

  • fetches all episode traces for capture reconciliation instead of using the paginated viewer list() path;
  • matches reflect replay steps by stable content signature, with a timing-insensitive fallback for recovered tool-call traces that lack startedAt/endedAt;
  • skips recovered replay orphan inserts by default (maxRecoveryOrphanInserts: 0) to avoid duplicating historical traces;
  • avoids LLM summarization for any recovered replay orphan insert that an operator explicitly allows;
  • adds a hard per-episode reflect LLM budget (maxReflectLlmCalls, default 128);
  • stops remaining reflect LLM attempts after terminal/circuit-open provider errors.

The goal is to keep normal short-episode reflection behavior intact while preventing a single dirty recovered episode from producing unbounded paid LLM calls.

Validation

From apps/memos-local-plugin:

npx vitest run tests/unit/capture/capture.test.ts tests/unit/capture/capture-batch.test.ts tests/unit/capture/normalizer.test.ts tests/unit/llm/client.test.ts
# Test Files 4 passed (4)
# Tests 57 passed (57)

npm run lint
# tsc -p tsconfig.json --noEmit
# passed

New regression coverage:

  • recovered replay matches tool traces by payload when timestamps drift;
  • startup-recovered replay orphans are not inserted by default;
  • reflect-phase LLM calls are capped per episode.

Related: #1897, #1899, #1938

autodev-bot and others added 4 commits June 8, 2026 20:34
…for terminal provider errors

Issue MemTensor#1897 reported ~12,900 paid LLM requests in 24 h on Hermes
against a DeepSeek key with insufficient balance. The local
`system_model_status` row count (12,900) closely tracked the
provider-side `request_count` (11,344) for the same billing window.
The naming is misleading: `system_model_status` is not a health probe;
it is the audit row written once per LLM call (ok / fallback / error)
inside `core/llm/client.ts`. With no circuit breaker, every pipeline
subscriber (capture / session-relation / reward / L2 / L3 / skill /
retrieval LLM filter / world-model) kept firing on every turn / closed
episode / induction, generating one paid request each.

Add a per-`LlmClient` circuit breaker:

- Trips on terminal errors: HTTP 401/402/403 or messages containing
  `insufficient balance` / `invalid api key` / `unauthorized` /
  `account suspended` / `billing`.
- Open: short-circuits subsequent calls inside the facade without
  contacting the provider. Throws `MemosError(LLM_UNAVAILABLE)` with
  `details.circuitOpen=true` so existing catch blocks still work.
- Half-open after cool-down (default 5 min, configurable, min 30 s):
  next call probes the provider; success closes the breaker, terminal
  failure re-opens it for another cool-down.
- Host fallback rescues a call without tripping the breaker —
  fallback exists precisely to keep going when the primary is down.
- Coalesces `system_model_status="circuit_open"` audit rows to at
  most one per ~25 s while the breaker stays open, so we don't
  replace paid spam with audit-row spam.
- Exposes `circuitOpen` / `circuitOpenUntil` / `circuitOpenedReason`
  via `LlmClientStats` for the Overview viewer card.
- Enabled by default; legacy behaviour available via
  `circuitBreaker.enabled = false`.

Tests: 9 new vitest cases under `tests/unit/llm/client.test.ts`
covering trip on 402, trip on "insufficient balance" message, no
trip on generic transient, coalescing, half-open close on success,
host-fallback rescues without trip, disabled mode, stats fields,
and re-open on terminal probe failure. All 59 LLM and 28 pipeline
tests pass; `tsc --noEmit` clean.

Out of scope (tracked separately): 429 `Retry-After` handling
(issue MemTensor#1620), per-tool rate limits, daily budget caps.
@Memtensor-AI

Copy link
Copy Markdown
Collaborator

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

  • Run: tr-f54600e4-cbf on cloud test-engine 10011
  • memos_local_plugin/unit: 46 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

@CarltonXiang CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25
@syzsunshine219 syzsunshine219 reopened this Jul 3, 2026
@syzsunshine219 syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026
@syzsunshine219 syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-audit Requires manual audit before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants