fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes#1799
Open
chiefmojo wants to merge 7 commits into
Open
fix(bridge): add 20s timeout guard to core.shutdown() to prevent orphaned processes#1799chiefmojo wants to merge 7 commits into
chiefmojo wants to merge 7 commits into
Conversation
core.shutdown() drains the L2/L3/skill flush pipeline which can block on hanging LLM calls. Without a deadline the bridge never exits after stdin EOF when the Python parent is already gone, re-creating the process-leak condition. Race all three shutdown sites (daemon SIGTERM, non-daemon SIGTERM, headless stdin-EOF) against a 20s timeout so the process always terminates within a bounded time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
73cce94 to
71b669d
Compare
Three sites missed by the original patch (56ebe7a3) were calling
core.shutdown() without withShutdownTimeout, leaving the bridge process
able to hang indefinitely if L2/L3/skill LLM calls stalled at shutdown:
• bridge.cts: EADDRINUSE exit (×2) — daemon can't bind viewer port
• bridge.cts: viewer-running keepalive path — stdin closes but viewer
is still serving; core.shutdown fires from the interval callback
All six core.shutdown() call sites now go through withShutdownTimeout,
guaranteeing the bridge exits within 20s regardless of which path is
taken. Adds bridge-shutdown-audit.md documenting the full call chain and
confirming no blocking sync calls prevent the timeout from firing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs(memos-local-plugin): clarify install path and stale dir names (MemTensor#1540) The README's 'Quick start' section told users to use install.sh instead of npm install, but the warning was buried and users still tried 'npm install -g @memtensor/memos-local-plugin' first. The reporter in MemTensor#1540 encountered this on a Hermes deployment. This change: - Promotes the 'do not run npm install -g' notice to a prominent IMPORTANT callout explaining why global install is wrong (no agent-home deploy, no config.yaml, no bridge/viewer) and that the tarball intentionally ships built artifacts only. - Adds a Troubleshooting subsection covering the two specific symptoms in the bug report: the 'package not found' misread, and the stale web/ and site/ directory names (web/ is now viewer/, site/ was removed by commit 26e7e3d). - Mentions install.ps1 for Windows alongside install.sh. - CHANGELOG: record the docs fix and reference MemTensor#1540. Documentation-only change; no code or runtime behavior touched. Co-authored-by: MemOS AutoDev <autodev@memtensor.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…_() got an unexpected keyword a (MemTensor#1889) fix: remove invalid chunker parameter from SystemParser test instantiation - SystemParser.__init__() signature changed to (embedder, llm=None) - Test was still passing chunker=None causing TypeError - Fixes all 5 failing tests in test_system_parser.py Fixes MemTensor#1888 Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…tributeError when given None (MemTensor#1884) * test: add comprehensive tests for clean_json_response (issue MemTensor#1525) - Add test suite in tests/mem_os/test_format_utils.py - Cover None input ValueError with diagnostic message - Cover markdown removal, whitespace stripping, edge cases - Verify fix for AttributeError when LLM returns None * style: format clean_json_response tests --------- Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…date_cube_access — fails for ev (MemTensor#1903) fix: validate current user not target in share_cube_with_user (MemTensor#1901) share_cube_with_user(cube_id, target_user_id) called _validate_cube_access(cube_id, target_user_id), but the validator signature is (user_id, cube_id). The cube_id therefore landed in the user_id slot and _validate_user_exists raised "User '<cube_id>' does not exist or is inactive" for every well-formed call, making the API unusable. The in-code comment "Validate current user has access to this cube" already documented the correct intent: the sharing user (self.user_id) must have access to the cube being shared, not the target. Switch the call to self._validate_cube_access(self.user_id, cube_id). The target user's existence is independently checked on the next line via validate_user(target_user_id), so that path is unchanged. Add regression tests in tests/mem_os/test_memos_core.py that pin down: - validate_user_cube_access is consulted with (self.user_id, cube_id), - add_user_to_cube is called with (target_user_id, cube_id) on success, - a missing target raises "Target user '<id>' does not exist". Closes MemTensor#1901 Co-authored-by: MemOS AutoDev Bot <autodev@memtensor.local> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
Collaborator
|
Automated Test Results: PASSED Cloud test-engine rerun against
Manual code review is still required before merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the Hermes gateway dies abnormally (SIGKILL, OOM, crash), the
--no-viewerbridge process callscore.shutdown()which chains throughflush()→ L2/L3 LLM calls that can block indefinitely. With the parent gone, nobody remains to send SIGKILL. Over 36 hours, 19 orphaned bridges accumulated consuming 299% CPU and writing duplicate traces (6,572 copies of a single turn).Fixes #1798.
Changes
Adds a
withShutdownTimeout()helper that racescore.shutdown()against a 20-second deadline. Wraps all sixcore.shutdown()call sites:bridge.cts)bridge/stdio.ts)bridge/stdio.ts)Also adds
bridge-shutdown-audit.mddocumenting the full call chain throughflush()→ L2/L3/skill, confirming all async operations yield the event loop and the 20s timeout is effective on every path.Verification
Related
This PR addresses the bridge-side half of the shutdown problem. The adapter-side complement — moving the synchronous
session.closeHTTP call off the asyncio event loop thread to prevent Discord heartbeat stalls when the bridge is unresponsive — is in #1953 (fix(adapter): fire session.close in daemon thread to unblock event loop).