Skip to content

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817

Open
chiefmojo wants to merge 9 commits into
MemTensor:mainfrom
chiefmojo:pr/may27-fixes
Open

fix: bridge startup race, dirty-reward orphan traces, RPC loopback auth#1817
chiefmojo wants to merge 9 commits into
MemTensor:mainfrom
chiefmojo:pr/may27-fixes

Conversation

@chiefmojo

Copy link
Copy Markdown

Summary

Three targeted fixes from companion infrastructure stabilization:

1. Skip orphan trace insert during dirty-reward recovery (fc7252fc)

When dirtyFlag triggers a rescore on episodes with no matching turns, the capture path inserts a trace with r_reward=0 but no valid idParent. This creates orphan rows that inflate the database and break FTS indexing. This fix checks for the orphan condition and skips the trace insert entirely, logging a warning instead.

2. Bridge startup race (ef311ebb)

Hermes gateway restarts can race with the MemOS daemon warm-up (~30s). The default 15s poll deadline and 30s session.open timeout cause cascading failures when the bridge isn't ready. This extends the deadline to 45s, adds exponential backoff health probes, and ups the session timeout to 60s. Double-spawn prevention via PID file singleton guard ensures only one bridge process survives a race.

3. RPC loopback auth (a459a0a5)

The /api/v1/rpc session exemption was overly broad — any caller could bypass auth. This restricts the exemption to loopback callers only (127.0.0.1 / ::1).

Testing

All three fixes have been running on three companion instances (neuromancer, wintermute, case) for 48+ hours with zero bridge failures, no new orphan traces, and clean RPC auth.

@Memtensor-AI Memtensor-AI added the plugin Plugin/adapter/bridge layer (apps/ directory) | 插件/适配层 label May 28, 2026

@Memtensor-AI Memtensor-AI left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me review this PR carefully.

Needs changes (minor) — solid fixes overall, a few issues worth addressing

The three fixes are well-motivated and the 48h soak test is reassuring. A few things caught my eye:

1. __init__.py — Duplicated connection logic begs for extraction

The HTTP bridge connect-and-register pattern is copy-pasted 3 times (initial probe, late probe, reconnect). If the handshake ever changes (e.g., a second handler registration), you'll need to patch all three. Extract a helper like _try_http_bridge(session_id, timeout) -> bool.

2. __init__.pyhttp_bridge may be unbound on exception

http_bridge: MemosHttpClient | None = MemosHttpClient()

If MemosHttpClient() itself raises, http_bridge is never assigned, yet the except block calls http_bridge.close(). The type: ignore comment masks this. Safer:

http_bridge: MemosHttpClient | None = None
http_bridge = MemosHttpClient()

Same pattern applies to http_bridge_late.

3. __init__.pytime.sleep(1.0) in initialize

A hard 1s sleep on every cold start where the viewer port is free adds latency to the common "first launch" path. Consider making this conditional on evidence of a concurrent session (e.g., PID file exists but port not yet open), or at least document why 1s was chosen.

4. daemon_manager.py — Backoff heuristic is inverted

time.sleep(0.5 if (deadline - time.time()) > 30 else 2.0)

With a 45s deadline, deadline - time.time() > 30 is true only during the first ~15s. So you poll fast (0.5s) for the first 15s, then slow (2s) for the remaining 30s. That matches the PR description — just confirming the intent is correct. One nit: if the loop body takes non-trivial time, deadline - time.time() could go negative on the last iteration; not harmful, but a max(0, ...) would be clearer.

5. auth.ts — Loopback check is good but consider X-Forwarded-For

If the server ever sits behind a local reverse proxy (nginx, Caddy), remoteAddress will always be 127.0.0.1 regardless of the true client. This is fine today since the PR description says the binding is 0.0.0.0 directly, but worth a code comment noting the assumption. If a proxy is introduced later, this becomes an auth bypass.

6. capture.ts — Coupling to meta.recoveryReason magic string

if (input.episode.meta?.recoveryReason === "dirty_reward_rescore") {

If the recovery codepath ever changes the string or omits the field, the orphan loop silently resumes. Consider exporting the constant from the recovery module and importing it here, or at minimum adding a shared RECOVERY_REASONS enum/object.

7. capture.ts — Silent skip may hide real orphans during recovery

If a genuine orphan does appear during a dirty-reward rescore (e.g., a race between capture and recovery), it will be silently dropped. The warning log is good, but consider emitting a metric or structured event so monitoring can catch unexpected counts.


What looks good

  • PID-file singleton guard and zombie cleanup are the right call for the double-spawn problem.
  • The loopback auth restriction is a meaningful security improvement with minimal complexity.
  • The dirty-reward orphan fix correctly breaks the infinite recovery loop.
  • Error handling is consistent (suppress on cleanup, warn on fallback).

Ship it after addressing items 2 and 6; the rest are suggestions.

chiefmojo pushed a commit to chiefmojo/MemOS-contrib that referenced this pull request May 28, 2026
- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption
- capture.ts + memory-core.ts: extract magic string to RECOVERY_REASONS.DIRTY_REWARD_RESCORE in new recovery-constants.ts
- memory-core.test.ts: update assertions to use shared constant
- __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep
- daemon_manager.py: guard negative sleep in backoff loop; add probe_viewer_status() + startup_lock_active() helpers
@fayenix fayenix deleted the pr/may27-fixes branch May 30, 2026 15:17
@Memtensor-AI Memtensor-AI changed the base branch from main to dev-20260604-v2.0.19 June 10, 2026 15:43
Memtensor-AI and others added 8 commits June 14, 2026 17:24
docs(memos-local-plugin): clarify install path and stale dir names (MemTensor#1540)

The README's 'Quick start' section told users to use install.sh instead
of npm install, but the warning was buried and users still tried
'npm install -g @memtensor/memos-local-plugin' first. The reporter in
MemTensor#1540 encountered this on a Hermes deployment.

This change:

- Promotes the 'do not run npm install -g' notice to a prominent
  IMPORTANT callout explaining why global install is wrong (no
  agent-home deploy, no config.yaml, no bridge/viewer) and that the
  tarball intentionally ships built artifacts only.
- Adds a Troubleshooting subsection covering the two specific symptoms
  in the bug report: the 'package not found' misread, and the stale
  web/ and site/ directory names (web/ is now viewer/, site/ was
  removed by commit 26e7e3d).
- Mentions install.ps1 for Windows alongside install.sh.
- CHANGELOG: record the docs fix and reference MemTensor#1540.

Documentation-only change; no code or runtime behavior touched.

Co-authored-by: MemOS AutoDev <autodev@memtensor.ai>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…_() got an unexpected keyword a (MemTensor#1889)

fix: remove invalid chunker parameter from SystemParser test instantiation

- SystemParser.__init__() signature changed to (embedder, llm=None)
- Test was still passing chunker=None causing TypeError
- Fixes all 5 failing tests in test_system_parser.py

Fixes MemTensor#1888

Co-authored-by: MemOS AutoDev <autodev@memos.ai>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…tributeError when given None (MemTensor#1884)

* test: add comprehensive tests for clean_json_response (issue MemTensor#1525)

- Add test suite in tests/mem_os/test_format_utils.py
- Cover None input ValueError with diagnostic message
- Cover markdown removal, whitespace stripping, edge cases
- Verify fix for AttributeError when LLM returns None

* style: format clean_json_response tests

---------

Co-authored-by: MemOS AutoDev <autodev@memos.ai>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
…date_cube_access — fails for ev (MemTensor#1903)

fix: validate current user not target in share_cube_with_user (MemTensor#1901)

share_cube_with_user(cube_id, target_user_id) called
_validate_cube_access(cube_id, target_user_id), but the validator
signature is (user_id, cube_id). The cube_id therefore landed in the
user_id slot and _validate_user_exists raised
"User '<cube_id>' does not exist or is inactive" for every well-formed
call, making the API unusable.

The in-code comment "Validate current user has access to this cube"
already documented the correct intent: the sharing user (self.user_id)
must have access to the cube being shared, not the target. Switch the
call to self._validate_cube_access(self.user_id, cube_id). The target
user's existence is independently checked on the next line via
validate_user(target_user_id), so that path is unchanged.

Add regression tests in tests/mem_os/test_memos_core.py that pin down:
- validate_user_cube_access is consulted with (self.user_id, cube_id),
- add_user_to_cube is called with (target_user_id, cube_id) on success,
- a missing target raises "Target user '<id>' does not exist".

Closes MemTensor#1901

Co-authored-by: MemOS AutoDev Bot <autodev@memtensor.local>
Co-authored-by: Matthew <heimixiaozhuang@zju.edu.cn>
runReflect's orphan-steps fallback was inserting new trace rows whenever
a recovered episode's snapshot timestamps didn't match existing DB rows.
This caused trace_ids_json to grow on every bridge restart, keeping
reward.traceCount != traceIds.length and looping forever — generating
48k+ empty traces and continuous OpenRouter traffic.

Guard: when meta.recoveryReason === "dirty_reward_rescore", log and skip
the insert instead of writing new rows. Legitimate orphan handling (test
paths, genuinely dropped events) is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
daemon_manager.py: extend ensure_viewer_daemon health-poll deadline
from 15s to 45s to cover cold Node.js starts (tsx compile + SQLite
open + FTS warmup). After the first 15s, back off probes from 0.5s
to 2s to avoid hammering a slow-starting daemon.

__init__.py: increase _open_session timeout in init from 30s to 60s
so session.open doesn't time out before the daemon finishes starting.

Add a 1s double-spawn guard: when probe_viewer_status() returns
"free", wait 1s and re-probe before falling through to
ensure_viewer_daemon(). Eliminates the race where a second gateway
session spawns a second stdio bridge because the daemon from the
first session hasn't bound the port yet.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The RPC endpoint was unconditionally exempt from session-cookie auth
to allow the local Python adapter to connect without a browser session.
On hub-mode deployments (bindHost=0.0.0.0) this left the endpoint
reachable from the network without any auth when password protection
is enabled.

Now the exemption only applies when remoteAddress is 127.0.0.1, ::1,
or ::ffff:127.0.0.1. Standard installs (bindHost=127.0.0.1) are
unaffected; the Python adapter always connects from loopback and
continues to work. Network callers on hub deployments must hold a
valid session cookie.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption
- capture.ts + memory-core.ts: extract magic string to RECOVERY_REASONS.DIRTY_REWARD_RESCORE in new recovery-constants.ts
- memory-core.test.ts: update assertions to use shared constant
- __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep
- daemon_manager.py: guard negative sleep in backoff loop; add probe_viewer_status() + startup_lock_active() helpers
@Memtensor-AI Memtensor-AI changed the base branch from dev-20260604-v2.0.19 to dev-v2.0.22 July 1, 2026 13:16
@Memtensor-AI

Copy link
Copy Markdown
Collaborator

Automated Test Results: PASSED

Cloud test-engine rerun against dev-v2.0.22 completed successfully.

  • Run: tr-0feafc25-6e9 on cloud test-engine 10011
  • memos_local_plugin/changed-python-source: 3 passed, 0 failed, 0 skipped

Manual code review is still required before merge.

@CarltonXiang CarltonXiang deleted the branch MemTensor:main July 3, 2026 07:25
@syzsunshine219 syzsunshine219 reopened this Jul 3, 2026
@syzsunshine219 syzsunshine219 added the needs-audit Requires manual audit before merge label Jul 3, 2026
@syzsunshine219 syzsunshine219 changed the base branch from dev-v2.0.22 to main July 3, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-audit Requires manual audit before merge plugin Plugin/adapter/bridge layer (apps/ directory) | 插件/适配层

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants