Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout cascade

## Summary

The memos-local-plugin blocks OpenClaw's readiness for ~100 seconds during startup while processing orphan episodes from the previous session. During this time, the LLM provider is called repeatedly with 45-second timeouts — and many of those calls fail (400/timeout), making the delay even worse. The HTTP server does not start until all recovery work is done.

---

## Observed Behavior

Startup timeline from the logs:

```
14:26:36.496 INFO  [llm] init provider="openai_compatible" ...
14:26:36.731 INFO  [core.pipeline.memory-core] init.orphan_episodes.session_end_recover count=3
14:26:40.135 WARN  [llm.openai_compatible] http.non_ok status=400   ← reflection LLM call fails
... (many 400/timeout LLM calls for ~50 seconds) ...
14:27:34.787 WARN  [llm.openai_compatible] http.exception timedOut=true  ← 45s timeout hit
... (more timeouts and 400s) ...
14:27:48.049 INFO  [core.capture] capture.reflect.done              ← 3 episodes reflected
14:27:57.962 INFO  [core.reward.r-human] score.llm rHuman=0.75     ← reward scoring done
14:27:58.212 WARN  [core.memory.l2.induce] induce.llm_failed       ← L2 induction fails
... (L2 induce failures continue) ...
14:28:16.962 INFO  [server.http] server.started url="http://127.0.0.1:18799"

Total startup delay: ~100 seconds
```

After the server starts, the L2 induction failures continue for a while longer, but at least OpenClaw is usable at that point.

---

## Root Cause Analysis

### 1. Orphan episode recovery blocks server startup

The `init.orphan_episodes.session_end_recover` (3 episodes from previous session) triggers the full pipeline (reflection → reward → L2 induce) **before** the HTTP server is started. This means the entire recovery process blocks gateway readiness.

### 2. LLM timeouts dramatically amplify the delay

Each LLM call has `timeoutMs=45000` (45 seconds) with `maxRetries=3`. When SiliconFlow returns 400 (non-transient), the call still waits for the full timeout before failing. With 3 orphan episodes × multiple LLM calls per episode, the stalls cascade:

- Each reflect phase makes ~3-5 LLM calls per episode
- When a call times out (45s) or gets 400 (retry exhausted), the next call only starts after the previous one fully fails
- With 3 orphan episodes, this can easily accumulate to 60-100 seconds

### 3. No deferred processing

If orphan recovery were deferred to a background task (after the server starts), OpenClaw could become usable within 1-2 seconds while memos processes old sessions in parallel.

---

## Proposed Fix

### High priority: Make orphan episode recovery non-blocking

- Process orphan episodes in a background task instead of blocking the startup sequence
- The HTTP server should start immediately, allowing the user to interact with OpenClaw
- Recovery results can be streamed to the viewer as they complete

### Medium priority: Add a grace period / timeout cap for startup

- During startup, cap per-LLM-call timeout to a shorter value (e.g., 10s) since these are non-critical recovery tasks
- Normal LLM calls after startup continue to use the configured `timeoutMs`

### Low priority: Feedback signal during startup

- Show a "processing N orphan episodes..." message so the user knows what's happening
- Log approximate progress (e.g., "episode 2/3 reflect done")

---

## Environment

- **Plugin version**: v2.0.15 (commit e0ef84dd)
- **LLM timeoutMs**: 45000
- **maxRetries**: 3
- **Orphan episodes**: 3
- **Host**: OpenClaw v2 (Linux x64, Node.js v24.14.1)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout cascade #1776

Summary

Observed Behavior

Root Cause Analysis

1. Orphan episode recovery blocks server startup

2. LLM timeouts dramatically amplify the delay

3. No deferred processing

Proposed Fix

High priority: Make orphan episode recovery non-blocking

Medium priority: Add a grace period / timeout cap for startup

Low priority: Feedback signal during startup

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout cascade #1776

Description

Summary

Observed Behavior

Root Cause Analysis

1. Orphan episode recovery blocks server startup

2. LLM timeouts dramatically amplify the delay

3. No deferred processing

Proposed Fix

High priority: Make orphan episode recovery non-blocking

Medium priority: Add a grace period / timeout cap for startup

Low priority: Feedback signal during startup

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions