Fix #1929: Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on#2038
Conversation
…sor#1929) `GET /api/v1/embeddings/maintenance` used to paginate every trace / policy / world_model / skill row through `repos.*.list()`, which reads full BLOB vector columns and decodes them into Float32Array on the JS heap purely so the maintenance path could inspect each vector's length. On a production deployment with ~93K traces × 2 vector columns × 1536 dims × 4 bytes ≈ 1.1 GB of BLOB pread64 traffic and ~270 MB of JS heap allocations per request, all on the synchronous better-sqlite3 path — the entire OpenClaw gateway event loop was starved for 4+ minutes at 100% CPU while the stats call ran, as strace (99.96% pread64) and the observed `eventLoopDelayMaxMs=285883` / `durationMs=292731` confirmed. The maintenance endpoint only needs counts, not the vector bodies. This change adds `embeddingMaintenanceCounts()` to `core/storage/repos/`, which issues five `SELECT COUNT(*) + SUM(CASE WHEN ...)` queries — one per `(table, vec column)` slot — using `LENGTH(vec)` for the dimension check. SQLite's `LENGTH()` on a BLOB column returns the header byte count without copying the buffer, so the stats path never leaves SQLite. The two pre-fix semantic filters (`shouldTraceHaveEmbeddings` and `isLightweightMemoryTrace`) are preserved verbatim in the WHERE clauses so per-bucket counts do not shift for already-installed users. The public `EmbeddingMaintenanceStats` JSON shape is unchanged. - Add `core/storage/repos/embedding_maintenance.ts` with SQL-only `embeddingMaintenanceCounts()` + `inferStoredEmbeddingByteLen()` helpers. - Re-export them (and `FLOAT32_BYTES` / `EmbeddingCounts`) from `core/storage/repos/index.ts`. - Rewire `core/pipeline/memory-core.ts::computeEmbeddingMaintenanceStats()` to the SQL fast path; drop the dead `inferStoredEmbeddingDimension(slots)` and `emptyEmbeddingStatsByKind()` helpers. - New `tests/unit/storage/embedding-maintenance.test.ts` (4 cases) pins the bucket semantics, lightweight-memory carveout, short-text filter, dim-mismatch detection, empty-DB safety, `expectedByteLen=0` fallback, and the mode-based byte-length inference. The tier-2 `scanAndTopK` bounding the reporter flagged as an "Additional Fix" is out of scope for this PR — the title and OpenClaw event-loop-block log both point at the maintenance endpoint, and keeping the surface tight makes the fix easy to review and revert. Verification: 4/4 new unit tests pass, 28/28 memory-core façade tests pass, `npx vitest run` shows 1048 passing / 3 pre-existing failures (v7 e2e / namespace-visibility migrator regression / traces-count > 500) that all reproduce on the base branch after `git stash` — unrelated to this change. `tsc -p tsconfig.json --noEmit` and `tsc -p tsconfig.build.json` both clean. Fixes MemTensor#1929 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🤖 Open Code ReviewTarget: PR #2038 🔍 OpenCodeReview found 2 issue(s) in this PR. 1.
|
…in addNeedsRepair
Return type was pinned to `EmbeddingMaintenanceStats["byKind"]["trace"]`,
but the helper is reused for all four bucket kinds (trace, policy,
world_model, skill). If any bucket shape ever diverged, TypeScript would
silently accept a structurally compatible but semantically wrong type on
the non-trace buckets.
Switch the return type to `EmbeddingCountsBucket & { needsRepair: number }`
— the shared bucket shape from `embedding_maintenance.ts` — which is what
the helper actually produces and does not implicitly couple to the trace
slot.
Fixes Open Code Review finding on PR MemTensor#2038.
🤖 Open Code ReviewTarget: PR #2038 🔍 OpenCodeReview found 14 issue(s) in this PR. 1.
|
…nd reuse EmbeddingCountsBucket Follow-up to PR MemTensor#2038 Open Code Review: 1. `core/storage/repos/index.ts`: the `embedding_maintenance` symbols were imported and then re-exported in two separate statements, while every other barrel entry in this file uses the direct `export { … } from "…"` form (lines 77–93). Collapse to the same shape: export { embeddingMaintenanceCounts, inferStoredEmbeddingByteLen, FLOAT32_BYTES, } from "./embedding_maintenance.js"; export type { EmbeddingCounts, EmbeddingCountsBucket } from "./embedding_maintenance.js"; 2. `core/pipeline/memory-core.ts`: `addNeedsRepair()` declared its `bucket` parameter as an inline literal `{ totalSlots; ready; missing; dimMismatch }` that is byte-for-byte identical to the already-imported `EmbeddingCountsBucket`. Replace the inline literal with the named type so the structural contract lives in one place. Behaviour unchanged — pure type / re-export tidy-up.
|
Automated Test Results: PASSED\n\nCloud test-engine rerun after resolving the dev-v2.0.22 merge conflict.\n\nRun: tr-d0d77e53-48e\nScope: memos_local_plugin\nResult: 34/34 tests passed\nCommand group: memos_local_plugin/unit\nDuration: 29s\n\nLocal pre-push verification also passed: npm run build, plus focused vitest for embedding-maintenance and memory-core.\n\nStatus: merge conflict resolved; automated scope test passed. Manual code review is still required before merge. |
Description
Fixes issue #1929 —
GET /api/v1/embeddings/maintenanceblocking the Node.js event loop for 4+ minutes at 100% CPU on databases with largetracestables. Root cause wascomputeEmbeddingMaintenanceStats()→collectEmbeddingSlots()paginating every trace/policy/world_model/skill row throughrepos.*.list(), which reads and decodes the BLOB vector columns (vec_summary/vec_action/vec) purely so the stats path could inspect each vector's length. On the reporter's DB that meant ~1.1 GB of pread64 traffic and ~270 MB of JS heap allocations per request, all on the synchronous better-sqlite3 path.The fix introduces
embeddingMaintenanceCounts()inapps/memos-local-plugin/core/storage/repos/embedding_maintenance.ts, which issues fiveSELECT COUNT(*) + SUM(CASE WHEN ...)queries — one per(table, vec column)slot — usingLENGTH(vec)for the dimension check. SQLite'sLENGTH()returns the BLOB header byte length without copying the buffer, so the stats path never leaves SQLite.computeEmbeddingMaintenanceStats()incore/pipeline/memory-core.tsis rewired to the new helper; the pre-fix semantic filters (shouldTraceHaveEmbeddingsshort-text skip andisLightweightMemoryTraceaction-vec carveout) are preserved verbatim in the SQL WHERE clauses so per-bucket counts do not shift for already-installed users. The publicEmbeddingMaintenanceStatsJSON shape is unchanged — the HTTP route, JSON-RPC bridge, viewer, and existing tests see the same response.The tier-2
scanAndTopKbounding the reporter flagged as an "Additional Fix" is intentionally out of scope for this PR; the title and OpenClaw event-loop-block log both point at the maintenance endpoint, and keeping the surface tight makes the fix easy to review and revert.Verification: 4/4 new unit tests pass (
tests/unit/storage/embedding-maintenance.test.ts), 28/28 memory-core façade tests pass (tests/unit/pipeline/memory-core.test.ts, including the pre-existingrepairs missing and wrong-dimension imported trace embeddingsanddoes not require action vectors for lightweight memory tracesregressions).npx vitest runacross the whole suite shows 1048 passing / 3 pre-existing failures (e2e/v7-full-chain,migrator::namespace-visibility,traces-count::> 500) that all reproduce on the base branchdev-20260624-v2.0.22aftergit stash— unrelated to this change.tsc -p tsconfig.json --noEmitandtsc -p tsconfig.build.jsonboth exit 0.Related Issue (Required): Fixes #1929
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Automated tests are pending.
Checklist
@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.
Reviewer Checklist