Skip to content

Fix #2028: fix: memory error#2036

Merged
syzsunshine219 merged 3 commits into
MemTensor:dev-v2.0.22from
Memtensor-AI:bugfix/autodev-2028
Jul 2, 2026
Merged

Fix #2028: fix: memory error#2036
syzsunshine219 merged 3 commits into
MemTensor:dev-v2.0.22from
Memtensor-AI:bugfix/autodev-2028

Conversation

@Memtensor-AI

Copy link
Copy Markdown
Collaborator

Description

Fixes #2028 by recovering the Hermes MemOS plugin from a hung Node bridge subprocess. Issue #2028 has a byte-for-byte identical error signature to already-solved #1722 ([timeout] memory.search did not respond within 30.0s + [timeout] turn.end did not respond within 30.0s), but the #1722 fix was never merged onto base branch dev-20260624-v2.0.22; the current base still carries all three cooperating defects. This change ports that fix forward.

The fix addresses three Python-adapter defects that turn "Node bridge subprocess dies quietly" into "every memory tool times out for 30 s and then fails forever". First, MemosBridgeClient._read_loop was wrapped in try/finally that calls the new _abort_pending() helper on any exit, waking every parked JSON-RPC waiter with transport_closed in under a second (previously they parked on the full per-request timeout). Second, MemosBridgeClient.request now checks Popen.poll() before writing, so a request against an already-exited subprocess fast-fails instead of buffering into a dead pipe. Third, the keepalive loop was routed through a new _should_reconnect_after_keepalive_failure() predicate that reconnects on BridgeError("timeout", …) and dead-subprocess in addition to the pre-existing transport_closed trigger. Finally, all six read-path memory tools (memos_search, memos_get, memos_timeline, memos_environment, memos_skill_list, memos_skill_get) were routed through the new _bridge_request_with_retry() helper that reconnects + retries once on transport_closed, mirroring the existing sync_turn pattern.

Verified with 7 new unit tests (2 at bridge level: reader-exit abort + poll-based fast-fail; 3 at keepalive predicate level: timeout / dead-subprocess / no-storm-on-transient; 2 at read-path retry level: reconnect-and-recover + retry-still-fails-surfaces-error). Combined test_bridge_client + test_hermes_provider_pipeline run: 49 tests pass in 0.28 s — down from 32.19 s pre-fix, direct evidence that the reader-thread abort actually wakes pending waiters. ruff check and ruff format --check clean across the plugin and its tests.

Scope is Python-only: no wire-protocol change, no dependency changes, no Node bridge changes, no config changes. The per-request 30 s timeout budget is preserved for genuinely slow calls. Shutdown races are unchanged (the existing _bridge_keepalive_stop gate still guards reconnects). Reviewers: @MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller.

Related Issue (Required): Fixes #2028

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g. code style improvements, linting)
  • Documentation update

How Has This Been Tested?

Automated tests are pending.

  • Unit Test
  • Test Script Or Test Steps (please provide)
  • Pipeline Automated API Test (please provide)

Checklist

  • I have performed a self-review of my own code
  • I have commented my code in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have created related documentation issue/PR in MemOS-Docs (if applicable)
  • I have linked the issue to this PR (if applicable)
  • I have mentioned the person who will review this PR

@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.

Reviewer Checklist

…bridge

Ports the archived MemTensor#1722 fix onto dev-20260624-v2.0.22, which never
received it. Three cooperating defects in the Hermes MemOS Python
adapter turn "Node bridge subprocess dies quietly" into "every memory
tool times out for 30 s and then fails forever":

1. `MemosBridgeClient._read_loop` iterated `for line in stdout:` and
   returned silently on EOF, leaving pending JSON-RPC waiters parked
   on their per-request timeout. Now wrapped in try/finally with a
   new `_abort_pending()` helper that wakes every waiter with
   `transport_closed` in under a second.
2. `MemosBridgeClient.request` now checks `Popen.poll()` before
   writing, so a request against an already-exited subprocess fast-
   fails with `transport_closed` instead of buffering into a dead
   pipe.
3. `MemTensorProvider` keepalive previously reconnected only on
   `transport_closed`; a hung bridge (`BridgeError("timeout", …)`)
   was dropped at DEBUG. New helper
   `_should_reconnect_after_keepalive_failure()` reconnects on
   timeout, transport_closed, or any error observed while the
   subprocess is already dead.
4. Read-path tools (`memos_search`, `memos_get`, `memos_timeline`,
   `memos_environment`, `memos_skill_list`, `memos_skill_get`) now
   route through `_bridge_request_with_retry()` that reconnects and
   retries once on `transport_closed`, mirroring the `sync_turn`
   pattern.

Verified: 33 bridge tests + 16 pipeline tests pass; combined run 49
tests OK in 0.28 s (was 32 s pre-fix — direct evidence the reader-
thread abort actually wakes pending waiters). Ruff check + format
clean. No wire-protocol change, no new dependencies.

Fixes MemTensor#2028
@Memtensor-AI

Copy link
Copy Markdown
Collaborator Author

🤖 Open Code Review

Target: PR #2036
Task: f7f2ab26022eef23
Base: dev-20260624-v2.0.22
Head: bugfix/autodev-2028

🔍 OpenCodeReview found 7 issue(s) in this PR.


1. apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py (L1753-L1757)

Fragile assertion after _reconnect_bridge during shutdown.

_reconnect_bridge has two early-return paths (lines ~1798, ~1804) that exit without setting self._bridge when _bridge_keepalive_stop is set (i.e., during shutdown). If a tool call races with shutdown — the transport closes, the retry enters here, and _reconnect_bridge returns early — self._bridge remains None and the assert raises AssertionError with no diagnostic detail.

Although handle_tool_call wraps the whole dispatch in except Exception, the error surfaced to the model would be the opaque string "AssertionError". Replace the assert with an explicit, descriptive raise:

self._reconnect_bridge(self._session_id, timeout=30.0)
if self._bridge is None:
    raise BridgeError("transport_closed", "bridge unavailable after reconnect attempt")
kwargs = {} if timeout is None else {"timeout": timeout}
return self._bridge.request(method, params, **kwargs)

2. apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py (L1789)

Tight coupling to a private BridgeClient internal (_proc).

Accessing bridge._proc directly ties MemTensorProvider to the internal implementation detail of MemosBridgeClient. The type: ignore[attr-defined] comment acknowledges this already. If MemosBridgeClient is refactored (e.g., _proc renamed or wrapped), this silently returns exit_code = None (due to the broad except Exception) and the dead-subprocess detection disappears without any warning — the very scenario this logic is meant to handle.

bridge_client.py already calls self._proc.poll() internally in the newly added Layer 2 check. Consider exposing a public is_alive() -> bool method on MemosBridgeClient:

# In MemosBridgeClient:
def is_alive(self) -> bool:
    """Return True if the bridge subprocess is still running."""
    with contextlib.suppress(Exception):
        return self._proc.poll() is None
    return False

Then here:

if bridge is not None and hasattr(bridge, "is_alive") and not bridge.is_alive():
    return True

3. apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py (L1740-L1757)

Four-way conditional duplication of the timeout branch.

The if timeout is None / else guard is repeated identically in both the initial call and the retry, yielding four nearly identical branches. This makes the method harder to read and maintain — any future change to the request signature must be applied in four places.

Simplify with a kwargs dict:

kwargs: dict[str, Any] = {} if timeout is None else {"timeout": timeout}

assert self._bridge is not None
try:
    return self._bridge.request(method, params, **kwargs)
except BridgeError as err:
    if not self._is_transport_closed(err):
        raise
    logger.info(
        "MemOS: bridge transport closed on %s; reconnecting and retrying once — %s",
        method,
        err,
    )
    self._reconnect_bridge(self._session_id, timeout=30.0)
    if self._bridge is None:
        raise BridgeError("transport_closed", "bridge unavailable after reconnect attempt")
    return self._bridge.request(method, params, **kwargs)

4. apps/memos-local-plugin/tests/python/test_bridge_client.py (L361-L367)

Flaky race condition: time.sleep(0.05) is not a reliable synchronisation barrier. The goal is to ensure client.request() has registered its waiter into _pending before _done = True kills the reader loop. On a loaded CI runner, the 50 ms window can be insufficient — the worker thread may not yet have acquired self._lock and inserted its entry into _pending, causing the reader's _abort_pending() finally: block to fire on an empty dict. The waiter is then enqueued after the reader exits, so no one ever sets its event; the worker parks for 10 s until its own per-request timeout fires. assertFalse(worker.is_alive()) then fails after 2 s.

Replace the sleep with an explicit synchronisation primitive. For example, add a short-circuit threading.Event that is set by a thin monkeypatch after _pending registration:

_registered = threading.Event()
_orig_request = client.request
def _instrumented(*args, **kwargs):
    result = _orig_request(*args, **kwargs)
    return result
# OR: monkeypatch _pending insertion via a subclass hook

Alternatively, expose a test-only _pending_event in MemosBridgeClient, or convert the test to a synchronous unit test on _abort_pending directly.


5. apps/memos-local-plugin/tests/python/test_bridge_client.py (L67-L69)

Class-level mutable attribute is fragile: poll_return is declared at class level (outside __init__). While self._fake.poll_return = 137 does create an instance-level shadow and won't affect other existing instances, the default None lives in the class dict rather than on each instance. If a future test creates multiple FakePopen instances, mutates poll_return on the class (e.g. FakePopen.poll_return = 1), or forgets to assign at the instance level, all instances silently share the mutation.

Initialise it in __init__ alongside the other instance attributes:

self.poll_return: int | None = None

6. apps/memos-local-plugin/tests/python/test_bridge_client.py (L361-L367)

Race condition: time.sleep(0.05) is not a reliable synchronisation barrier. The goal is to ensure client.request() has registered its waiter in _pending before _done = True terminates the reader loop. On a loaded CI runner the 50 ms window can be too short — the worker thread may not yet have acquired self._lock and inserted its entry, so the reader's _abort_pending() finally: block fires on an empty dict. The waiter is then enqueued after the reader has already exited; no one ever sets its event, and the worker parks for the full 10 s per-request timeout. assertFalse(worker.is_alive()) then fails after 2 s.

Replace the sleep with an explicit synchronisation primitive. For example, expose a threading.Event that fires inside MemosBridgeClient.request() just after the _pending insertion (as a test-only hook), or drive the test synchronously by testing _abort_pending() directly.


7. apps/memos-local-plugin/tests/python/test_bridge_client.py (L67-L69)

Class-level mutable attribute: poll_return is declared at class scope, outside __init__. While self._fake.poll_return = 137 does create an instance-level shadow (so it won't affect other existing instances), the default None lives in the class dict rather than on each instance. A future test that assigns directly on the class (FakePopen.poll_return = 1) or forgets the instance-level assignment would silently affect all instances.

Initialise it in __init__ alongside the other instance attributes:

self.poll_return: int | None = None

Generated by cloud-assistant via Open Code Review.

@Memtensor-AI Memtensor-AI changed the base branch from dev-20260624-v2.0.22 to dev-v2.0.22 July 2, 2026 07:28
…-2028

# Conflicts:
#	apps/memos-local-plugin/adapters/hermes/memos_provider/__init__.py
@Memtensor-AI

Copy link
Copy Markdown
Collaborator Author

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully after resolving the merge conflict.

  • Run: tr-5cdd93d5-273 on cloud test-engine 10012
  • Scope: memos_local_plugin/changed-repo-python
  • Result: 41 passed, 0 failed, 0 skipped

Local verification during conflict resolution also passed: apps/memos-local-plugin/tests/python/test_bridge_client.py 41/41.

Manual code review is still required before merge.

@Memtensor-AI

Copy link
Copy Markdown
Collaborator Author

Automated Test Results: PASSED

Cloud test-engine rerun after resolving the dev-v2.0.22 merge conflict completed successfully.

  • Run: tr-084b1a4b-9a3
  • Head: 9c085ad3455a10a00a08836e1a57380847b8009e
  • Scopes tested: memos_local_plugin
  • Result: 47/47 tests passed
  • Duration: 2s

The PR is now mergeable on GitHub. Manual review is still required before merge.

@syzsunshine219 syzsunshine219 merged commit 1771568 into MemTensor:dev-v2.0.22 Jul 2, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated bug Something isn't working | 功能异常

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants