Fix #2028: fix: memory error#2036
Conversation
…bridge Ports the archived MemTensor#1722 fix onto dev-20260624-v2.0.22, which never received it. Three cooperating defects in the Hermes MemOS Python adapter turn "Node bridge subprocess dies quietly" into "every memory tool times out for 30 s and then fails forever": 1. `MemosBridgeClient._read_loop` iterated `for line in stdout:` and returned silently on EOF, leaving pending JSON-RPC waiters parked on their per-request timeout. Now wrapped in try/finally with a new `_abort_pending()` helper that wakes every waiter with `transport_closed` in under a second. 2. `MemosBridgeClient.request` now checks `Popen.poll()` before writing, so a request against an already-exited subprocess fast- fails with `transport_closed` instead of buffering into a dead pipe. 3. `MemTensorProvider` keepalive previously reconnected only on `transport_closed`; a hung bridge (`BridgeError("timeout", …)`) was dropped at DEBUG. New helper `_should_reconnect_after_keepalive_failure()` reconnects on timeout, transport_closed, or any error observed while the subprocess is already dead. 4. Read-path tools (`memos_search`, `memos_get`, `memos_timeline`, `memos_environment`, `memos_skill_list`, `memos_skill_get`) now route through `_bridge_request_with_retry()` that reconnects and retries once on `transport_closed`, mirroring the `sync_turn` pattern. Verified: 33 bridge tests + 16 pipeline tests pass; combined run 49 tests OK in 0.28 s (was 32 s pre-fix — direct evidence the reader- thread abort actually wakes pending waiters). Ruff check + format clean. No wire-protocol change, no new dependencies. Fixes MemTensor#2028
🤖 Open Code ReviewTarget: PR #2036 🔍 OpenCodeReview found 7 issue(s) in this PR. 1.
|
…-2028 # Conflicts: # apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py
|
Automated Test Results: PASSED Cloud test-engine rerun against
Local verification during conflict resolution also passed: Manual code review is still required before merge. |
Automated Test Results: PASSEDCloud test-engine rerun after resolving the
The PR is now mergeable on GitHub. Manual review is still required before merge. |
Description
Fixes #2028 by recovering the Hermes MemOS plugin from a hung Node bridge subprocess. Issue #2028 has a byte-for-byte identical error signature to already-solved #1722 (
[timeout] memory.search did not respond within 30.0s+[timeout] turn.end did not respond within 30.0s), but the #1722 fix was never merged onto base branchdev-20260624-v2.0.22; the current base still carries all three cooperating defects. This change ports that fix forward.The fix addresses three Python-adapter defects that turn "Node bridge subprocess dies quietly" into "every memory tool times out for 30 s and then fails forever". First,
MemosBridgeClient._read_loopwas wrapped in try/finally that calls the new_abort_pending()helper on any exit, waking every parked JSON-RPC waiter withtransport_closedin under a second (previously they parked on the full per-request timeout). Second,MemosBridgeClient.requestnow checksPopen.poll()before writing, so a request against an already-exited subprocess fast-fails instead of buffering into a dead pipe. Third, the keepalive loop was routed through a new_should_reconnect_after_keepalive_failure()predicate that reconnects onBridgeError("timeout", …)and dead-subprocess in addition to the pre-existingtransport_closedtrigger. Finally, all six read-path memory tools (memos_search,memos_get,memos_timeline,memos_environment,memos_skill_list,memos_skill_get) were routed through the new_bridge_request_with_retry()helper that reconnects + retries once ontransport_closed, mirroring the existingsync_turnpattern.Verified with 7 new unit tests (2 at bridge level: reader-exit abort + poll-based fast-fail; 3 at keepalive predicate level: timeout / dead-subprocess / no-storm-on-transient; 2 at read-path retry level: reconnect-and-recover + retry-still-fails-surfaces-error). Combined
test_bridge_client+test_hermes_provider_pipelinerun: 49 tests pass in 0.28 s — down from 32.19 s pre-fix, direct evidence that the reader-thread abort actually wakes pending waiters.ruff checkandruff format --checkclean across the plugin and its tests.Scope is Python-only: no wire-protocol change, no dependency changes, no Node bridge changes, no config changes. The per-request 30 s timeout budget is preserved for genuinely slow calls. Shutdown races are unchanged (the existing
_bridge_keepalive_stopgate still guards reconnects). Reviewers: @MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller.Related Issue (Required): Fixes #2028
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Automated tests are pending.
Checklist
@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.
Reviewer Checklist