fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes by chiefmojo · Pull Request #1799 · MemTensor/MemOS

chiefmojo · 2026-05-24T20:28:46Z

Problem

When the Hermes gateway dies abnormally (SIGKILL, OOM, crash), the --no-viewer bridge process calls core.shutdown() which chains through flush() → L2/L3 LLM calls that can block indefinitely. With the parent gone, nobody remains to send SIGKILL. Over 36 hours, 19 orphaned bridges accumulated consuming 299% CPU and writing duplicate traces (6,572 copies of a single turn).

Fixes #1798.

Changes

Adds a withShutdownTimeout() helper that races core.shutdown() against a 20-second deadline. Wraps all six core.shutdown() call sites:

Daemon SIGTERM handler (bridge.cts)
Non-daemon SIGTERM handler (bridge/stdio.ts)
Headless stdin-EOF exit (bridge/stdio.ts)
EADDRINUSE exit (×2) — daemon cannot bind viewer port
Viewer-running keepalive path — stdin closes while viewer is still serving

Also adds bridge-shutdown-audit.md documenting the full call chain through flush() → L2/L3/skill, confirming all async operations yield the event loop and the 20s timeout is effective on every path.

const SHUTDOWN_TIMEOUT_MS = 20_000;
function withShutdownTimeout(p: Promise<void>): Promise<void> {
  return Promise.race([p, new Promise<void>((r) => setTimeout(r, SHUTDOWN_TIMEOUT_MS))]);
}

Verification

Code audit — traced every shutdown path, confirmed no sync blocking ops prevent the timeout from firing
Stress test — kill gateway with SIGKILL, verify bridges exit within 20s, zero orphans
Watchdog — cron job monitors bridge count across companions, alerts on accumulation

This PR addresses the bridge-side half of the shutdown problem. The adapter-side complement — moving the synchronous session.close HTTP call off the asyncio event loop thread to prevent Discord heartbeat stalls when the bridge is unresponsive — is in #1953 (fix(adapter): fire session.close in daemon thread to unblock event loop).

core.shutdown() drains the L2/L3/skill flush pipeline which can block on hanging LLM calls. Without a deadline the bridge never exits after stdin EOF when the Python parent is already gone, re-creating the process-leak condition. Race all three shutdown sites (daemon SIGTERM, non-daemon SIGTERM, headless stdin-EOF) against a 20s timeout so the process always terminates within a bounded time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three sites missed by the original patch (56ebe7a3) were calling core.shutdown() without withShutdownTimeout, leaving the bridge process able to hang indefinitely if L2/L3/skill LLM calls stalled at shutdown: • bridge.cts: EADDRINUSE exit (×2) — daemon can't bind viewer port • bridge.cts: viewer-running keepalive path — stdin closes but viewer is still serving; core.shutdown fires from the interval callback All six core.shutdown() call sites now go through withShutdownTimeout, guaranteeing the bridge exits within 20s regardless of which path is taken. Adds bridge-shutdown-audit.md documenting the full call chain and confirming no blocking sync calls prevent the timeout from firing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(memos-local-plugin): clarify install path and stale dir names (MemTensor#1540) The README's 'Quick start' section told users to use install.sh instead of npm install, but the warning was buried and users still tried 'npm install -g @memtensor/memos-local-plugin' first. The reporter in MemTensor#1540 encountered this on a Hermes deployment. This change: - Promotes the 'do not run npm install -g' notice to a prominent IMPORTANT callout explaining why global install is wrong (no agent-home deploy, no config.yaml, no bridge/viewer) and that the tarball intentionally ships built artifacts only. - Adds a Troubleshooting subsection covering the two specific symptoms in the bug report: the 'package not found' misread, and the stale web/ and site/ directory names (web/ is now viewer/, site/ was removed by commit 26e7e3d). - Mentions install.ps1 for Windows alongside install.sh. - CHANGELOG: record the docs fix and reference MemTensor#1540. Documentation-only change; no code or runtime behavior touched. Co-authored-by: MemOS AutoDev <autodev@memtensor.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…_() got an unexpected keyword a (MemTensor#1889) fix: remove invalid chunker parameter from SystemParser test instantiation - SystemParser.__init__() signature changed to (embedder, llm=None) - Test was still passing chunker=None causing TypeError - Fixes all 5 failing tests in test_system_parser.py Fixes MemTensor#1888 Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…tributeError when given None (MemTensor#1884) * test: add comprehensive tests for clean_json_response (issue MemTensor#1525) - Add test suite in tests/mem_os/test_format_utils.py - Cover None input ValueError with diagnostic message - Cover markdown removal, whitespace stripping, edge cases - Verify fix for AttributeError when LLM returns None * style: format clean_json_response tests --------- Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…date_cube_access — fails for ev (MemTensor#1903) fix: validate current user not target in share_cube_with_user (MemTensor#1901) share_cube_with_user(cube_id, target_user_id) called _validate_cube_access(cube_id, target_user_id), but the validator signature is (user_id, cube_id). The cube_id therefore landed in the user_id slot and _validate_user_exists raised "User '<cube_id>' does not exist or is inactive" for every well-formed call, making the API unusable. The in-code comment "Validate current user has access to this cube" already documented the correct intent: the sharing user (self.user_id) must have access to the cube being shared, not the target. Switch the call to self._validate_cube_access(self.user_id, cube_id). The target user's existence is independently checked on the next line via validate_user(target_user_id), so that path is unchanged. Add regression tests in tests/mem_os/test_memos_core.py that pin down: - validate_user_cube_access is consulted with (self.user_id, cube_id), - add_user_to_cube is called with (target_user_id, cube_id) on success, - a missing target raises "Target user '<id>' does not exist". Closes MemTensor#1901 Co-authored-by: MemOS AutoDev Bot <autodev@memtensor.local> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

Memtensor-AI · 2026-07-02T12:29:52Z

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

Run: tr-0c78cd87-439 on cloud test-engine 10011
memos_local_plugin/unit: 33 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

fayenix force-pushed the fix/bridge-shutdown-timeout branch from 73cce94 to 71b669d Compare May 24, 2026 20:37

Memtensor-AI changed the base branch from main to dev-20260604-v2.0.19 June 10, 2026 15:40

Memtensor-AI and others added 5 commits June 14, 2026 17:24

Merge branch 'dev-20260604-v2.0.19' into fix/bridge-shutdown-timeout

c8beaaa

chiefmojo mentioned this pull request Jun 21, 2026

fix(adapter): fire session.close in daemon thread to unblock event loop #1953

Open

Memtensor-AI changed the base branch from dev-20260604-v2.0.19 to dev-v2.0.22 July 1, 2026 13:16

CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25

CarltonXiang closed this Jul 3, 2026

syzsunshine219 reopened this Jul 3, 2026

syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026

syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes#1799

fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes#1799
chiefmojo wants to merge 7 commits into
MemTensor:mainfrom
chiefmojo:fix/bridge-shutdown-timeout

chiefmojo commented May 24, 2026 •

edited

Loading

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

chiefmojo commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Verification

Related

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chiefmojo commented May 24, 2026 •

edited

Loading