Fix #2028: fix: memory error by Memtensor-AI · Pull Request #2036 · MemTensor/MemOS

Memtensor-AI · 2026-07-02T06:35:33Z

Description

Fixes #2028 by recovering the Hermes MemOS plugin from a hung Node bridge subprocess. Issue #2028 has a byte-for-byte identical error signature to already-solved #1722 ([timeout] memory.search did not respond within 30.0s + [timeout] turn.end did not respond within 30.0s), but the #1722 fix was never merged onto base branch dev-20260624-v2.0.22; the current base still carries all three cooperating defects. This change ports that fix forward.

The fix addresses three Python-adapter defects that turn "Node bridge subprocess dies quietly" into "every memory tool times out for 30 s and then fails forever". First, MemosBridgeClient._read_loop was wrapped in try/finally that calls the new _abort_pending() helper on any exit, waking every parked JSON-RPC waiter with transport_closed in under a second (previously they parked on the full per-request timeout). Second, MemosBridgeClient.request now checks Popen.poll() before writing, so a request against an already-exited subprocess fast-fails instead of buffering into a dead pipe. Third, the keepalive loop was routed through a new _should_reconnect_after_keepalive_failure() predicate that reconnects on BridgeError("timeout", …) and dead-subprocess in addition to the pre-existing transport_closed trigger. Finally, all six read-path memory tools (memos_search, memos_get, memos_timeline, memos_environment, memos_skill_list, memos_skill_get) were routed through the new _bridge_request_with_retry() helper that reconnects + retries once on transport_closed, mirroring the existing sync_turn pattern.

Verified with 7 new unit tests (2 at bridge level: reader-exit abort + poll-based fast-fail; 3 at keepalive predicate level: timeout / dead-subprocess / no-storm-on-transient; 2 at read-path retry level: reconnect-and-recover + retry-still-fails-surfaces-error). Combined test_bridge_client + test_hermes_provider_pipeline run: 49 tests pass in 0.28 s — down from 32.19 s pre-fix, direct evidence that the reader-thread abort actually wakes pending waiters. ruff check and ruff format --check clean across the plugin and its tests.

Scope is Python-only: no wire-protocol change, no dependency changes, no Node bridge changes, no config changes. The per-request 30 s timeout budget is preserved for genuinely slow calls. Shutdown races are unchanged (the existing _bridge_keepalive_stop gate still guards reconnects). Reviewers: @MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller.

Related Issue (Required): Fixes #2028

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g. code style improvements, linting)
Documentation update

How Has This Been Tested?

Automated tests are pending.

Unit Test
Test Script Or Test Steps (please provide)
Pipeline Automated API Test (please provide)

Checklist

I have performed a self-review of my own code
I have commented my code in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have created related documentation issue/PR in MemOS-Docs (if applicable)
I have linked the issue to this PR (if applicable)
I have mentioned the person who will review this PR

@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.

Reviewer Checklist

closes fix: memory error #2028
Made sure Checks passed
Tests have been provided

…bridge Ports the archived MemTensor#1722 fix onto dev-20260624-v2.0.22, which never received it. Three cooperating defects in the Hermes MemOS Python adapter turn "Node bridge subprocess dies quietly" into "every memory tool times out for 30 s and then fails forever": 1. `MemosBridgeClient._read_loop` iterated `for line in stdout:` and returned silently on EOF, leaving pending JSON-RPC waiters parked on their per-request timeout. Now wrapped in try/finally with a new `_abort_pending()` helper that wakes every waiter with `transport_closed` in under a second. 2. `MemosBridgeClient.request` now checks `Popen.poll()` before writing, so a request against an already-exited subprocess fast- fails with `transport_closed` instead of buffering into a dead pipe. 3. `MemTensorProvider` keepalive previously reconnected only on `transport_closed`; a hung bridge (`BridgeError("timeout", …)`) was dropped at DEBUG. New helper `_should_reconnect_after_keepalive_failure()` reconnects on timeout, transport_closed, or any error observed while the subprocess is already dead. 4. Read-path tools (`memos_search`, `memos_get`, `memos_timeline`, `memos_environment`, `memos_skill_list`, `memos_skill_get`) now route through `_bridge_request_with_retry()` that reconnects and retries once on `transport_closed`, mirroring the `sync_turn` pattern. Verified: 33 bridge tests + 16 pipeline tests pass; combined run 49 tests OK in 0.28 s (was 32 s pre-fix — direct evidence the reader- thread abort actually wakes pending waiters). Ruff check + format clean. No wire-protocol change, no new dependencies. Fixes MemTensor#2028

Memtensor-AI · 2026-07-02T06:43:44Z

🤖 Open Code Review

Target: PR #2036
Task: f7f2ab26022eef23
Base: dev-20260624-v2.0.22
Head: bugfix/autodev-2028

🔍 OpenCodeReview found 7 issue(s) in this PR.

1. `apps/memos-local-plugin/adapters/hermes/memos_provider/init.py` (L1753-L1757)

Fragile assertion after _reconnect_bridge during shutdown.

_reconnect_bridge has two early-return paths (lines ~1798, ~1804) that exit without setting self._bridge when _bridge_keepalive_stop is set (i.e., during shutdown). If a tool call races with shutdown — the transport closes, the retry enters here, and _reconnect_bridge returns early — self._bridge remains None and the assert raises AssertionError with no diagnostic detail.

Although handle_tool_call wraps the whole dispatch in except Exception, the error surfaced to the model would be the opaque string "AssertionError". Replace the assert with an explicit, descriptive raise:

self._reconnect_bridge(self._session_id, timeout=30.0)
if self._bridge is None:
    raise BridgeError("transport_closed", "bridge unavailable after reconnect attempt")
kwargs = {} if timeout is None else {"timeout": timeout}
return self._bridge.request(method, params, **kwargs)

2. `apps/memos-local-plugin/adapters/hermes/memos_provider/init.py` (L1789)

Tight coupling to a private BridgeClient internal (_proc).

Accessing bridge._proc directly ties MemTensorProvider to the internal implementation detail of MemosBridgeClient. The type: ignore[attr-defined] comment acknowledges this already. If MemosBridgeClient is refactored (e.g., _proc renamed or wrapped), this silently returns exit_code = None (due to the broad except Exception) and the dead-subprocess detection disappears without any warning — the very scenario this logic is meant to handle.

bridge_client.py already calls self._proc.poll() internally in the newly added Layer 2 check. Consider exposing a public is_alive() -> bool method on MemosBridgeClient:

# In MemosBridgeClient:
def is_alive(self) -> bool:
    """Return True if the bridge subprocess is still running."""
    with contextlib.suppress(Exception):
        return self._proc.poll() is None
    return False

Then here:

if bridge is not None and hasattr(bridge, "is_alive") and not bridge.is_alive():
    return True

3. `apps/memos-local-plugin/adapters/hermes/memos_provider/init.py` (L1740-L1757)

Four-way conditional duplication of the timeout branch.

The if timeout is None / else guard is repeated identically in both the initial call and the retry, yielding four nearly identical branches. This makes the method harder to read and maintain — any future change to the request signature must be applied in four places.

Simplify with a kwargs dict:

kwargs: dict[str, Any] = {} if timeout is None else {"timeout": timeout}

assert self._bridge is not None
try:
    return self._bridge.request(method, params, **kwargs)
except BridgeError as err:
    if not self._is_transport_closed(err):
        raise
    logger.info(
        "MemOS: bridge transport closed on %s; reconnecting and retrying once — %s",
        method,
        err,
    )
    self._reconnect_bridge(self._session_id, timeout=30.0)
    if self._bridge is None:
        raise BridgeError("transport_closed", "bridge unavailable after reconnect attempt")
    return self._bridge.request(method, params, **kwargs)

4. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L361-L367)

Flaky race condition: time.sleep(0.05) is not a reliable synchronisation barrier. The goal is to ensure client.request() has registered its waiter into _pending before _done = True kills the reader loop. On a loaded CI runner, the 50 ms window can be insufficient — the worker thread may not yet have acquired self._lock and inserted its entry into _pending, causing the reader's _abort_pending() finally: block to fire on an empty dict. The waiter is then enqueued after the reader exits, so no one ever sets its event; the worker parks for 10 s until its own per-request timeout fires. assertFalse(worker.is_alive()) then fails after 2 s.

Replace the sleep with an explicit synchronisation primitive. For example, add a short-circuit threading.Event that is set by a thin monkeypatch after _pending registration:

_registered = threading.Event()
_orig_request = client.request
def _instrumented(*args, **kwargs):
    result = _orig_request(*args, **kwargs)
    return result
# OR: monkeypatch _pending insertion via a subclass hook

Alternatively, expose a test-only _pending_event in MemosBridgeClient, or convert the test to a synchronous unit test on _abort_pending directly.

5. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L67-L69)

Class-level mutable attribute is fragile: poll_return is declared at class level (outside __init__). While self._fake.poll_return = 137 does create an instance-level shadow and won't affect other existing instances, the default None lives in the class dict rather than on each instance. If a future test creates multiple FakePopen instances, mutates poll_return on the class (e.g. FakePopen.poll_return = 1), or forgets to assign at the instance level, all instances silently share the mutation.

Initialise it in __init__ alongside the other instance attributes:

self.poll_return: int | None = None

6. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L361-L367)

Race condition: time.sleep(0.05) is not a reliable synchronisation barrier. The goal is to ensure client.request() has registered its waiter in _pending before _done = True terminates the reader loop. On a loaded CI runner the 50 ms window can be too short — the worker thread may not yet have acquired self._lock and inserted its entry, so the reader's _abort_pending() finally: block fires on an empty dict. The waiter is then enqueued after the reader has already exited; no one ever sets its event, and the worker parks for the full 10 s per-request timeout. assertFalse(worker.is_alive()) then fails after 2 s.

Replace the sleep with an explicit synchronisation primitive. For example, expose a threading.Event that fires inside MemosBridgeClient.request() just after the _pending insertion (as a test-only hook), or drive the test synchronously by testing _abort_pending() directly.

7. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L67-L69)

Class-level mutable attribute: poll_return is declared at class scope, outside __init__. While self._fake.poll_return = 137 does create an instance-level shadow (so it won't affect other existing instances), the default None lives in the class dict rather than on each instance. A future test that assigns directly on the class (FakePopen.poll_return = 1) or forgets the instance-level assignment would silently affect all instances.

Initialise it in __init__ alongside the other instance attributes:

self.poll_return: int | None = None

Generated by cloud-assistant via Open Code Review.

…-2028 # Conflicts: # apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py

Memtensor-AI · 2026-07-02T10:10:08Z

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully after resolving the merge conflict.

Run: tr-5cdd93d5-273 on cloud test-engine 10012
Scope: memos_local_plugin/changed-repo-python
Result: 41 passed, 0 failed, 0 skipped

Local verification during conflict resolution also passed: apps/memos-local-plugin/tests/python/test_bridge_client.py 41/41.

Manual code review is still required before merge.

Memtensor-AI · 2026-07-02T12:56:07Z

Automated Test Results: PASSED

Cloud test-engine rerun after resolving the dev-v2.0.22 merge conflict completed successfully.

Run: tr-084b1a4b-9a3
Head: 9c085ad3455a10a00a08836e1a57380847b8009e
Scopes tested: memos_local_plugin
Result: 47/47 tests passed
Duration: 2s

The PR is now mergeable on GitHub. Manual review is still required before merge.

Memtensor-AI assigned CarltonXiang, MatthewZhuang, syzsunshine219 and World-controller Jul 2, 2026

Memtensor-AI requested review from CarltonXiang, MatthewZhuang, World-controller and syzsunshine219 July 2, 2026 06:35

Memtensor-AI added ai-generated bug Something isn't working | 功能异常 labels Jul 2, 2026

Memtensor-AI mentioned this pull request Jul 2, 2026

fix: memory error #2028

Closed

5 tasks

Memtensor-AI changed the base branch from dev-20260624-v2.0.22 to dev-v2.0.22 July 2, 2026 07:28

Merge remote-tracking branch 'origin/dev-v2.0.22' into bugfix/autodev…

6f783f5

…-2028 # Conflicts: # apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py

Resolve dev-v2.0.22 merge conflict for PR 2036

9c085ad

syzsunshine219 merged commit 1771568 into MemTensor:dev-v2.0.22 Jul 2, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #2028: fix: memory error#2036

Fix #2028: fix: memory error#2036
syzsunshine219 merged 3 commits into
MemTensor:dev-v2.0.22from
Memtensor-AI:bugfix/autodev-2028

Memtensor-AI commented Jul 2, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Memtensor-AI commented Jul 2, 2026

Description

Type of change

How Has This Been Tested?

Checklist

Reviewer Checklist

Uh oh!

Memtensor-AI commented Jul 2, 2026

🤖 Open Code Review

1. apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py (L1753-L1757)

2. apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py (L1789)

3. apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py (L1740-L1757)

4. apps/memos-local-plugin/tests/python/test_bridge_client.py (L361-L367)

5. apps/memos-local-plugin/tests/python/test_bridge_client.py (L67-L69)

6. apps/memos-local-plugin/tests/python/test_bridge_client.py (L361-L367)

7. apps/memos-local-plugin/tests/python/test_bridge_client.py (L67-L69)

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Automated Test Results: PASSED

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1. `apps/memos-local-plugin/adapters/hermes/memos_provider/init.py` (L1753-L1757)

2. `apps/memos-local-plugin/adapters/hermes/memos_provider/init.py` (L1789)

3. `apps/memos-local-plugin/adapters/hermes/memos_provider/init.py` (L1740-L1757)

4. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L361-L367)

5. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L67-L69)

6. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L361-L367)

7. `apps/memos-local-plugin/tests/python/test_bridge_client.py` (L67-L69)