Skip to content

Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout cascade #1776

Description

@34262315716

Summary

The memos-local-plugin blocks OpenClaw's readiness for ~100 seconds during startup while processing orphan episodes from the previous session. During this time, the LLM provider is called repeatedly with 45-second timeouts — and many of those calls fail (400/timeout), making the delay even worse. The HTTP server does not start until all recovery work is done.


Observed Behavior

Startup timeline from the logs:

14:26:36.496 INFO  [llm] init provider="openai_compatible" ...
14:26:36.731 INFO  [core.pipeline.memory-core] init.orphan_episodes.session_end_recover count=3
14:26:40.135 WARN  [llm.openai_compatible] http.non_ok status=400   ← reflection LLM call fails
... (many 400/timeout LLM calls for ~50 seconds) ...
14:27:34.787 WARN  [llm.openai_compatible] http.exception timedOut=true  ← 45s timeout hit
... (more timeouts and 400s) ...
14:27:48.049 INFO  [core.capture] capture.reflect.done              ← 3 episodes reflected
14:27:57.962 INFO  [core.reward.r-human] score.llm rHuman=0.75     ← reward scoring done
14:27:58.212 WARN  [core.memory.l2.induce] induce.llm_failed       ← L2 induction fails
... (L2 induce failures continue) ...
14:28:16.962 INFO  [server.http] server.started url="http://127.0.0.1:18799"

Total startup delay: ~100 seconds

After the server starts, the L2 induction failures continue for a while longer, but at least OpenClaw is usable at that point.


Root Cause Analysis

1. Orphan episode recovery blocks server startup

The init.orphan_episodes.session_end_recover (3 episodes from previous session) triggers the full pipeline (reflection → reward → L2 induce) before the HTTP server is started. This means the entire recovery process blocks gateway readiness.

2. LLM timeouts dramatically amplify the delay

Each LLM call has timeoutMs=45000 (45 seconds) with maxRetries=3. When SiliconFlow returns 400 (non-transient), the call still waits for the full timeout before failing. With 3 orphan episodes × multiple LLM calls per episode, the stalls cascade:

  • Each reflect phase makes ~3-5 LLM calls per episode
  • When a call times out (45s) or gets 400 (retry exhausted), the next call only starts after the previous one fully fails
  • With 3 orphan episodes, this can easily accumulate to 60-100 seconds

3. No deferred processing

If orphan recovery were deferred to a background task (after the server starts), OpenClaw could become usable within 1-2 seconds while memos processes old sessions in parallel.


Proposed Fix

High priority: Make orphan episode recovery non-blocking

  • Process orphan episodes in a background task instead of blocking the startup sequence
  • The HTTP server should start immediately, allowing the user to interact with OpenClaw
  • Recovery results can be streamed to the viewer as they complete

Medium priority: Add a grace period / timeout cap for startup

  • During startup, cap per-LLM-call timeout to a shorter value (e.g., 10s) since these are non-critical recovery tasks
  • Normal LLM calls after startup continue to use the configured timeoutMs

Low priority: Feedback signal during startup

  • Show a "processing N orphan episodes..." message so the user knows what's happening
  • Log approximate progress (e.g., "episode 2/3 reflect done")

Environment

  • Plugin version: v2.0.15 (commit e0ef84d)
  • LLM timeoutMs: 45000
  • maxRetries: 3
  • Orphan episodes: 3
  • Host: OpenClaw v2 (Linux x64, Node.js v24.14.1)

Metadata

Metadata

Assignees

Labels

ai-pr-readyAutoDev tests passed and PR is ready for human review/merge.bugSomething isn't working | 功能异常pluginPlugin/adapter/bridge layer (apps/ directory) | 插件/适配层

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions