Skip to content

fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes#1799

Open
chiefmojo wants to merge 7 commits into
MemTensor:mainfrom
chiefmojo:fix/bridge-shutdown-timeout
Open

fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes#1799
chiefmojo wants to merge 7 commits into
MemTensor:mainfrom
chiefmojo:fix/bridge-shutdown-timeout

Conversation

@chiefmojo

@chiefmojo chiefmojo commented May 24, 2026

Copy link
Copy Markdown

Problem

When the Hermes gateway dies abnormally (SIGKILL, OOM, crash), the --no-viewer bridge process calls core.shutdown() which chains through flush() → L2/L3 LLM calls that can block indefinitely. With the parent gone, nobody remains to send SIGKILL. Over 36 hours, 19 orphaned bridges accumulated consuming 299% CPU and writing duplicate traces (6,572 copies of a single turn).

Fixes #1798.

Changes

Adds a withShutdownTimeout() helper that races core.shutdown() against a 20-second deadline. Wraps all six core.shutdown() call sites:

  1. Daemon SIGTERM handler (bridge.cts)
  2. Non-daemon SIGTERM handler (bridge/stdio.ts)
  3. Headless stdin-EOF exit (bridge/stdio.ts)
  4. EADDRINUSE exit (×2) — daemon cannot bind viewer port
  5. Viewer-running keepalive path — stdin closes while viewer is still serving

Also adds bridge-shutdown-audit.md documenting the full call chain through flush() → L2/L3/skill, confirming all async operations yield the event loop and the 20s timeout is effective on every path.

const SHUTDOWN_TIMEOUT_MS = 20_000;
function withShutdownTimeout(p: Promise<void>): Promise<void> {
  return Promise.race([p, new Promise<void>((r) => setTimeout(r, SHUTDOWN_TIMEOUT_MS))]);
}

Verification

  • Code audit — traced every shutdown path, confirmed no sync blocking ops prevent the timeout from firing
  • Stress test — kill gateway with SIGKILL, verify bridges exit within 20s, zero orphans
  • Watchdog — cron job monitors bridge count across companions, alerts on accumulation

Related

This PR addresses the bridge-side half of the shutdown problem. The adapter-side complement — moving the synchronous session.close HTTP call off the asyncio event loop thread to prevent Discord heartbeat stalls when the bridge is unresponsive — is in #1953 (fix(adapter): fire session.close in daemon thread to unblock event loop).

core.shutdown() drains the L2/L3/skill flush pipeline which can block
on hanging LLM calls. Without a deadline the bridge never exits after
stdin EOF when the Python parent is already gone, re-creating the
process-leak condition. Race all three shutdown sites (daemon SIGTERM,
non-daemon SIGTERM, headless stdin-EOF) against a 20s timeout so the
process always terminates within a bounded time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fayenix fayenix force-pushed the fix/bridge-shutdown-timeout branch from 73cce94 to 71b669d Compare May 24, 2026 20:37
Three sites missed by the original patch (56ebe7a3) were calling
core.shutdown() without withShutdownTimeout, leaving the bridge process
able to hang indefinitely if L2/L3/skill LLM calls stalled at shutdown:

  • bridge.cts: EADDRINUSE exit (×2) — daemon can't bind viewer port
  • bridge.cts: viewer-running keepalive path — stdin closes but viewer
    is still serving; core.shutdown fires from the interval callback

All six core.shutdown() call sites now go through withShutdownTimeout,
guaranteeing the bridge exits within 20s regardless of which path is
taken. Adds bridge-shutdown-audit.md documenting the full call chain and
confirming no blocking sync calls prevent the timeout from firing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Memtensor-AI Memtensor-AI changed the base branch from main to dev-20260604-v2.0.19 June 10, 2026 15:40
Memtensor-AI and others added 5 commits June 14, 2026 17:24
docs(memos-local-plugin): clarify install path and stale dir names (MemTensor#1540)

The README's 'Quick start' section told users to use install.sh instead
of npm install, but the warning was buried and users still tried
'npm install -g @memtensor/memos-local-plugin' first. The reporter in
MemTensor#1540 encountered this on a Hermes deployment.

This change:

- Promotes the 'do not run npm install -g' notice to a prominent
  IMPORTANT callout explaining why global install is wrong (no
  agent-home deploy, no config.yaml, no bridge/viewer) and that the
  tarball intentionally ships built artifacts only.
- Adds a Troubleshooting subsection covering the two specific symptoms
  in the bug report: the 'package not found' misread, and the stale
  web/ and site/ directory names (web/ is now viewer/, site/ was
  removed by commit 26e7e3d).
- Mentions install.ps1 for Windows alongside install.sh.
- CHANGELOG: record the docs fix and reference MemTensor#1540.

Documentation-only change; no code or runtime behavior touched.

Co-authored-by: MemOS AutoDev <autodev@memtensor.ai>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…_() got an unexpected keyword a (MemTensor#1889)

fix: remove invalid chunker parameter from SystemParser test instantiation

- SystemParser.__init__() signature changed to (embedder, llm=None)
- Test was still passing chunker=None causing TypeError
- Fixes all 5 failing tests in test_system_parser.py

Fixes MemTensor#1888

Co-authored-by: MemOS AutoDev <autodev@memos.ai>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…tributeError when given None (MemTensor#1884)

* test: add comprehensive tests for clean_json_response (issue MemTensor#1525)

- Add test suite in tests/mem_os/test_format_utils.py
- Cover None input ValueError with diagnostic message
- Cover markdown removal, whitespace stripping, edge cases
- Verify fix for AttributeError when LLM returns None

* style: format clean_json_response tests

---------

Co-authored-by: MemOS AutoDev <autodev@memos.ai>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…date_cube_access — fails for ev (MemTensor#1903)

fix: validate current user not target in share_cube_with_user (MemTensor#1901)

share_cube_with_user(cube_id, target_user_id) called
_validate_cube_access(cube_id, target_user_id), but the validator
signature is (user_id, cube_id). The cube_id therefore landed in the
user_id slot and _validate_user_exists raised
"User '<cube_id>' does not exist or is inactive" for every well-formed
call, making the API unusable.

The in-code comment "Validate current user has access to this cube"
already documented the correct intent: the sharing user (self.user_id)
must have access to the cube being shared, not the target. Switch the
call to self._validate_cube_access(self.user_id, cube_id). The target
user's existence is independently checked on the next line via
validate_user(target_user_id), so that path is unchanged.

Add regression tests in tests/mem_os/test_memos_core.py that pin down:
- validate_user_cube_access is consulted with (self.user_id, cube_id),
- add_user_to_cube is called with (target_user_id, cube_id) on success,
- a missing target raises "Target user '<id>' does not exist".

Closes MemTensor#1901

Co-authored-by: MemOS AutoDev Bot <autodev@memtensor.local>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
@Memtensor-AI Memtensor-AI changed the base branch from dev-20260604-v2.0.19 to dev-v2.0.22 July 1, 2026 13:16
@Memtensor-AI

Copy link
Copy Markdown
Collaborator

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

  • Run: tr-0c78cd87-439 on cloud test-engine 10011
  • memos_local_plugin/unit: 33 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

@CarltonXiang CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25
@syzsunshine219 syzsunshine219 reopened this Jul 3, 2026
@syzsunshine219 syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026
@syzsunshine219 syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-audit Requires manual audit before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bridge: core.shutdown() hangs indefinitely when gateway dies abnormally (orphaned bridge)

5 participants