fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth by chiefmojo · Pull Request #1817 · MemTensor/MemOS

chiefmojo · 2026-05-28T04:52:38Z

Summary

Three targeted fixes from companion infrastructure stabilization:

1. Skip orphan trace insert during dirty-reward recovery (`fc7252fc`)

When dirtyFlag triggers a rescore on episodes with no matching turns, the capture path inserts a trace with r_reward=0 but no valid idParent. This creates orphan rows that inflate the database and break FTS indexing. This fix checks for the orphan condition and skips the trace insert entirely, logging a warning instead.

2. Bridge startup race (`ef311ebb`)

Hermes gateway restarts can race with the MemOS daemon warm-up (~30s). The default 15s poll deadline and 30s session.open timeout cause cascading failures when the bridge isn't ready. This extends the deadline to 45s, adds exponential backoff health probes, and ups the session timeout to 60s. Double-spawn prevention via PID file singleton guard ensures only one bridge process survives a race.

3. RPC loopback auth (`a459a0a5`)

The /api/v1/rpc session exemption was overly broad — any caller could bypass auth. This restricts the exemption to loopback callers only (127.0.0.1 / ::1).

Testing

All three fixes have been running on three companion instances (neuromancer, wintermute, case) for 48+ hours with zero bridge failures, no new orphan traces, and clean RPC auth.

Memtensor-AI

Let me review this PR carefully.

Needs changes (minor) — solid fixes overall, a few issues worth addressing

The three fixes are well-motivated and the 48h soak test is reassuring. A few things caught my eye:

1. `init.py` — Duplicated connection logic begs for extraction

The HTTP bridge connect-and-register pattern is copy-pasted 3 times (initial probe, late probe, reconnect). If the handshake ever changes (e.g., a second handler registration), you'll need to patch all three. Extract a helper like _try_http_bridge(session_id, timeout) -> bool.

2. `init.py` — `http_bridge` may be unbound on exception

http_bridge: MemosHttpClient | None = MemosHttpClient()

If MemosHttpClient() itself raises, http_bridge is never assigned, yet the except block calls http_bridge.close(). The type: ignore comment masks this. Safer:

http_bridge: MemosHttpClient | None = None
http_bridge = MemosHttpClient()

Same pattern applies to http_bridge_late.

3. `init.py` — `time.sleep(1.0)` in `initialize`

A hard 1s sleep on every cold start where the viewer port is free adds latency to the common "first launch" path. Consider making this conditional on evidence of a concurrent session (e.g., PID file exists but port not yet open), or at least document why 1s was chosen.

4. `daemon_manager.py` — Backoff heuristic is inverted

time.sleep(0.5 if (deadline - time.time()) > 30 else 2.0)

With a 45s deadline, deadline - time.time() > 30 is true only during the first ~15s. So you poll fast (0.5s) for the first 15s, then slow (2s) for the remaining 30s. That matches the PR description — just confirming the intent is correct. One nit: if the loop body takes non-trivial time, deadline - time.time() could go negative on the last iteration; not harmful, but a max(0, ...) would be clearer.

5. `auth.ts` — Loopback check is good but consider `X-Forwarded-For`

If the server ever sits behind a local reverse proxy (nginx, Caddy), remoteAddress will always be 127.0.0.1 regardless of the true client. This is fine today since the PR description says the binding is 0.0.0.0 directly, but worth a code comment noting the assumption. If a proxy is introduced later, this becomes an auth bypass.

6. `capture.ts` — Coupling to `meta.recoveryReason` magic string

if (input.episode.meta?.recoveryReason === "dirty_reward_rescore") {

If the recovery codepath ever changes the string or omits the field, the orphan loop silently resumes. Consider exporting the constant from the recovery module and importing it here, or at minimum adding a shared RECOVERY_REASONS enum/object.

7. `capture.ts` — Silent skip may hide real orphans during recovery

If a genuine orphan does appear during a dirty-reward rescore (e.g., a race between capture and recovery), it will be silently dropped. The warning log is good, but consider emitting a metric or structured event so monitoring can catch unexpected counts.

What looks good

PID-file singleton guard and zombie cleanup are the right call for the double-spawn problem.
The loopback auth restriction is a meaningful security improvement with minimal complexity.
The dirty-reward orphan fix correctly breaks the infinite recovery loop.
Error handling is consistent (suppress on cleanup, warn on fallback).

Ship it after addressing items 2 and 6; the rest are suggestions.

- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption - capture.ts + memory-core.ts: extract magic string to RECOVERY_REASONS.DIRTY_REWARD_RESCORE in new recovery-constants.ts - memory-core.test.ts: update assertions to use shared constant - __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep - daemon_manager.py: guard negative sleep in backoff loop; add probe_viewer_status() + startup_lock_active() helpers

docs(memos-local-plugin): clarify install path and stale dir names (MemTensor#1540) The README's 'Quick start' section told users to use install.sh instead of npm install, but the warning was buried and users still tried 'npm install -g @memtensor/memos-local-plugin' first. The reporter in MemTensor#1540 encountered this on a Hermes deployment. This change: - Promotes the 'do not run npm install -g' notice to a prominent IMPORTANT callout explaining why global install is wrong (no agent-home deploy, no config.yaml, no bridge/viewer) and that the tarball intentionally ships built artifacts only. - Adds a Troubleshooting subsection covering the two specific symptoms in the bug report: the 'package not found' misread, and the stale web/ and site/ directory names (web/ is now viewer/, site/ was removed by commit 26e7e3d). - Mentions install.ps1 for Windows alongside install.sh. - CHANGELOG: record the docs fix and reference MemTensor#1540. Documentation-only change; no code or runtime behavior touched. Co-authored-by: MemOS AutoDev <autodev@memtensor.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…_() got an unexpected keyword a (MemTensor#1889) fix: remove invalid chunker parameter from SystemParser test instantiation - SystemParser.__init__() signature changed to (embedder, llm=None) - Test was still passing chunker=None causing TypeError - Fixes all 5 failing tests in test_system_parser.py Fixes MemTensor#1888 Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…tributeError when given None (MemTensor#1884) * test: add comprehensive tests for clean_json_response (issue MemTensor#1525) - Add test suite in tests/mem_os/test_format_utils.py - Cover None input ValueError with diagnostic message - Cover markdown removal, whitespace stripping, edge cases - Verify fix for AttributeError when LLM returns None * style: format clean_json_response tests --------- Co-authored-by: MemOS AutoDev <autodev@memos.ai> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

…date_cube_access — fails for ev (MemTensor#1903) fix: validate current user not target in share_cube_with_user (MemTensor#1901) share_cube_with_user(cube_id, target_user_id) called _validate_cube_access(cube_id, target_user_id), but the validator signature is (user_id, cube_id). The cube_id therefore landed in the user_id slot and _validate_user_exists raised "User '<cube_id>' does not exist or is inactive" for every well-formed call, making the API unusable. The in-code comment "Validate current user has access to this cube" already documented the correct intent: the sharing user (self.user_id) must have access to the cube being shared, not the target. Switch the call to self._validate_cube_access(self.user_id, cube_id). The target user's existence is independently checked on the next line via validate_user(target_user_id), so that path is unchanged. Add regression tests in tests/mem_os/test_memos_core.py that pin down: - validate_user_cube_access is consulted with (self.user_id, cube_id), - add_user_to_cube is called with (target_user_id, cube_id) on success, - a missing target raises "Target user '<id>' does not exist". Closes MemTensor#1901 Co-authored-by: MemOS AutoDev Bot <autodev@memtensor.local> Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>

runReflect's orphan-steps fallback was inserting new trace rows whenever a recovered episode's snapshot timestamps didn't match existing DB rows. This caused trace_ids_json to grow on every bridge restart, keeping reward.traceCount != traceIds.length and looping forever — generating 48k+ empty traces and continuous OpenRouter traffic. Guard: when meta.recoveryReason === "dirty_reward_rescore", log and skip the insert instead of writing new rows. Legitimate orphan handling (test paths, genuinely dropped events) is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

daemon_manager.py: extend ensure_viewer_daemon health-poll deadline from 15s to 45s to cover cold Node.js starts (tsx compile + SQLite open + FTS warmup). After the first 15s, back off probes from 0.5s to 2s to avoid hammering a slow-starting daemon. __init__.py: increase _open_session timeout in init from 30s to 60s so session.open doesn't time out before the daemon finishes starting. Add a 1s double-spawn guard: when probe_viewer_status() returns "free", wait 1s and re-probe before falling through to ensure_viewer_daemon(). Eliminates the race where a second gateway session spawns a second stdio bridge because the daemon from the first session hasn't bound the port yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The RPC endpoint was unconditionally exempt from session-cookie auth to allow the local Python adapter to connect without a browser session. On hub-mode deployments (bindHost=0.0.0.0) this left the endpoint reachable from the network without any auth when password protection is enabled. Now the exemption only applies when remoteAddress is 127.0.0.1, ::1, or ::ffff:127.0.0.1. Standard installs (bindHost=127.0.0.1) are unaffected; the Python adapter always connects from loopback and continues to work. Network callers on hub deployments must hold a valid session cookie. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption - capture.ts + memory-core.ts: extract magic string to RECOVERY_REASONS.DIRTY_REWARD_RESCORE in new recovery-constants.ts - memory-core.test.ts: update assertions to use shared constant - __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep - daemon_manager.py: guard negative sleep in backoff loop; add probe_viewer_status() + startup_lock_active() helpers

Memtensor-AI · 2026-07-02T11:18:18Z

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

Run: tr-0feafc25-6e9 on cloud test-engine 10011
memos_local_plugin/changed-python-source: 3 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

Memtensor-AI added the plugin Plugin/adapter/bridge layer (apps/ directory) | 插件/适配层 label May 28, 2026

Memtensor-AI reviewed May 28, 2026

View reviewed changes

chiefmojo mentioned this pull request May 28, 2026

fix(memos): address review feedback on PRs #1813 and #1817 #1830

Closed

fayenix deleted the pr/may27-fixes branch May 30, 2026 15:17

Memtensor-AI changed the base branch from main to dev-20260604-v2.0.19 June 10, 2026 15:43

Memtensor-AI and others added 8 commits June 14, 2026 17:24

Memtensor-AI changed the base branch from dev-20260604-v2.0.19 to dev-v2.0.22 July 1, 2026 13:16

CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25

CarltonXiang closed this Jul 3, 2026

syzsunshine219 reopened this Jul 3, 2026

syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026

syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:22

Merge branch 'main' into pr/may27-fixes

0398e0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817
chiefmojo wants to merge 9 commits into
MemTensor:mainfrom
chiefmojo:pr/may27-fixes

chiefmojo commented May 28, 2026

Uh oh!

Memtensor-AI left a comment

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

chiefmojo commented May 28, 2026

Summary

1. Skip orphan trace insert during dirty-reward recovery (fc7252fc)

2. Bridge startup race (ef311ebb)

3. RPC loopback auth (a459a0a5)

Testing

Uh oh!

Memtensor-AI left a comment

Choose a reason for hiding this comment

Needs changes (minor) — solid fixes overall, a few issues worth addressing

1. __init__.py — Duplicated connection logic begs for extraction

2. __init__.py — http_bridge may be unbound on exception

3. __init__.py — time.sleep(1.0) in initialize

4. daemon_manager.py — Backoff heuristic is inverted

5. auth.ts — Loopback check is good but consider X-Forwarded-For

6. capture.ts — Coupling to meta.recoveryReason magic string

7. capture.ts — Silent skip may hide real orphans during recovery

What looks good

Uh oh!

Memtensor-AI commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1. Skip orphan trace insert during dirty-reward recovery (`fc7252fc`)

2. Bridge startup race (`ef311ebb`)

3. RPC loopback auth (`a459a0a5`)

1. `init.py` — Duplicated connection logic begs for extraction

2. `init.py` — `http_bridge` may be unbound on exception

3. `init.py` — `time.sleep(1.0)` in `initialize`

4. `daemon_manager.py` — Backoff heuristic is inverted

5. `auth.ts` — Loopback check is good but consider `X-Forwarded-For`

6. `capture.ts` — Coupling to `meta.recoveryReason` magic string

7. `capture.ts` — Silent skip may hide real orphans during recovery