fix(bridge): detect silent pipeline init failure and configurable watchdog by chiefmojo · Pull Request #1849 · MemTensor/MemOS

chiefmojo · 2026-06-01T01:53:09Z

Summary

Two fixes to the daemon bridge startup path that prevent silent scoring outages when core.init() hangs.

Problem

When the bridge starts in daemon mode, core.init() runs synchronously and can take 90–180 seconds for large databases (Qwen 235B scoring 250+ trace episodes). During this window:

The HTTP server is not yet serving, so the Python adapter's startup probe times out and spawns a replacement daemon — the original bridge is left running headless.
If core.init() neither resolves nor rejects (silent hang, e.g. LLM provider unreachable), the bridge appears healthy but the scoring pipeline is never wired. All subsequent episodes are captured but never scored.

Fix 1: async `core.init()` with `pipelineReady` flag

core.init() is now kicked off in the background immediately after the HTTP server binds. Health endpoints report pipelineReady: false until init completes, allowing the adapter to distinguish "starting" from "stuck". The HTTP server stays responsive to probes throughout init.

Fix 2: init watchdog (`initWatchdogMs`)

A configurable timeout (default 120 s, config key bridge.initWatchdogMs) races against core.init(). If init does not complete within the window, the bridge force-exits so the gateway respawns a fresh process rather than silently dropping scoring for hours.

Note for operators: Production experience shows scoring for large databases (250+ traces) can take 90–180 s. The 120 s default is intentionally conservative. Installations with large databases or slow LLM providers should set initWatchdogMs to at least 300 000 (5 minutes).

Test plan

Small DB: core.init() completes well before watchdog — normal startup
Simulated slow init: pipelineReady: false visible in /api/v1/health during init
Simulated hung init: bridge exits after initWatchdogMs ms, gateway respawns

🤖 Generated with Claude Code

…hdog - Track pipelineReady separately from initialized; set only after all pipeline bus subscribers are wired (end of init()), not at entry. Fixes health().ok falsely reporting true while init() is hung. - Add 120s startup watchdog in daemon mode: if core.init() neither resolves nor rejects within the deadline, force-exit so the gateway respawns instead of running headless for hours with scoring dead. - Wrap ensureHubRuntimeStarted() hub.start() in a 30s timeout so a stalled hub connection can't block init() indefinitely. - Expose pipelineReady on GET /api/v1/health and GET /api/v1/ping so monitoring can distinguish "bridge alive" from "scoring alive". - Viewer healthStatus degrades to "degraded" when pipelineReady=false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add bridge.initWatchdogMs to the config schema (default: 120 000 ms, minimum: 30 000 ms). bridge.cts reads the value from config instead of using the hardcoded constant. Operators running large episode histories (many traces, per-step scoring) can raise the timeout to survive a long startup recovery without disabling the watchdog entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(memos-local-plugin): clarify install path and stale dir names (MemTensor#1540) The README's 'Quick start' section told users to use install.sh instead of npm install, but the warning was buried and users still tried 'npm install -g @memtensor/memos-local-plugin' first. The reporter in MemTensor#1540 encountered this on a Hermes deployment. This change: - Promotes the 'do not run npm install -g' notice to a prominent IMPORTANT callout explaining why global install is wrong (no agent-home deploy, no config.yaml, no bridge/viewer) and that the tarball intentionally ships built artifacts only. - Adds a Troubleshooting subsection covering the two specific symptoms in the bug report: the 'package not found' misread, and the stale web/ and site/ directory names (web/ is now viewer/, site/ was removed by commit 26e7e3d). - Mentions install.ps1 for Windows alongside install.sh. - CHANGELOG: record the docs fix and reference MemTensor#1540. Documentation-only change; no code or runtime behavior touched. Co-authored-by: MemOS AutoDev <autodev@memtensor.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…_() got an unexpected keyword a (MemTensor#1889) fix: remove invalid chunker parameter from SystemParser test instantiation - SystemParser.__init__() signature changed to (embedder, llm=None) - Test was still passing chunker=None causing TypeError - Fixes all 5 failing tests in test_system_parser.py Fixes MemTensor#1888 Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…tributeError when given None (MemTensor#1884) * test: add comprehensive tests for clean_json_response (issue MemTensor#1525) - Add test suite in tests/mem_os/test_format_utils.py - Cover None input ValueError with diagnostic message - Cover markdown removal, whitespace stripping, edge cases - Verify fix for AttributeError when LLM returns None * style: format clean_json_response tests --------- Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…date_cube_access — fails for ev (MemTensor#1903) fix: validate current user not target in share_cube_with_user (MemTensor#1901) share_cube_with_user(cube_id, target_user_id) called _validate_cube_access(cube_id, target_user_id), but the validator signature is (user_id, cube_id). The cube_id therefore landed in the user_id slot and _validate_user_exists raised "User '<cube_id>' does not exist or is inactive" for every well-formed call, making the API unusable. The in-code comment "Validate current user has access to this cube" already documented the correct intent: the sharing user (self.user_id) must have access to the cube being shared, not the target. Switch the call to self._validate_cube_access(self.user_id, cube_id). The target user's existence is independently checked on the next line via validate_user(target_user_id), so that path is unchanged. Add regression tests in tests/mem_os/test_memos_core.py that pin down: - validate_user_cube_access is consulted with (self.user_id, cube_id), - add_user_to_cube is called with (target_user_id, cube_id) on success, - a missing target raises "Target user '<id>' does not exist". Closes MemTensor#1901 Co-authored-by: MemOS AutoDev Bot <autodev@memtensor.local> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

Memtensor-AI · 2026-07-02T11:14:41Z

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

Run: tr-cbfca4de-0dd on cloud test-engine 10011
memos_local_plugin/unit: 136 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

chiefmojo and others added 2 commits May 31, 2026 18:30

Memtensor-AI changed the base branch from main to dev-20260604-v2.0.19 June 10, 2026 15:41

Memtensor-AI and others added 5 commits June 14, 2026 17:24

Merge branch 'dev-20260604-v2.0.19' into pr/bridge-init-watchdog

7e2a48d

Memtensor-AI changed the base branch from dev-20260604-v2.0.19 to dev-v2.0.22 July 1, 2026 13:16

CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25

CarltonXiang closed this Jul 3, 2026

syzsunshine219 reopened this Jul 3, 2026

syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026

syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bridge): detect silent pipeline init failure and configurable watchdog#1849

fix(bridge): detect silent pipeline init failure and configurable watchdog#1849
chiefmojo wants to merge 7 commits into
MemTensor:mainfrom
chiefmojo:pr/bridge-init-watchdog

chiefmojo commented Jun 1, 2026

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

chiefmojo commented Jun 1, 2026

Summary

Problem

Fix 1: async core.init() with pipelineReady flag

Fix 2: init watchdog (initWatchdogMs)

Test plan

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix 1: async `core.init()` with `pipelineReady` flag

Fix 2: init watchdog (`initWatchdogMs`)