Skip to content

[BUG] Concurrent memory searches fail with deadlock #2135

Description

@Mayoengin

📋 Prerequisites

  • I have searched the existing issues to avoid creating a duplicate
  • By submitting this issue, you agree to follow our Code of Conduct
  • I am using the latest version of the software
  • I have tried to clear cache/cookies or used incognito mode (if ui-related)
  • I can consistently reproduce this issue

🎯 Affected Service(s)

Controller Service

🚦 Impact/Severity

Blocker

🐛 Bug Description

Agents with spec.memory configured intermittently fail memory search with HTTP 500, and turns are slow. The controller returns:

500 search failed: failed to increment access count: ERROR: deadlock detected (SQLSTATE 40P01)

The root cause is a non-deterministic row-lock order in IncrementMemoryAccessCount, triggered whenever multiple memory searches run concurrently over an overlapping set of rows — which PrefetchMemoryTool does on every first turn (one search per sentence). When the deadlock hits, the search returns 500 and the agent loses memory recall for that turn.

🔄 Steps To Reproduce

  1. Create an Agent with spec.memory set (an embedding ModelConfig) and store a handful of memories for it.
  2. Send a multi-sentence first message so PrefetchMemoryTool fans out several concurrent search_memory calls over the same top rows.
  3. Have this happen under any concurrency (e.g. a multi-sentence prompt, or two turns at once) so two searches return overlapping rows.
  4. Observe SQLSTATE 40P01 (deadlock detected) in the controller logs and HTTP 500 on /api/memories/search; the client retries with backoff and often ends with "No memories found".

🤔 Expected Behavior

Concurrent memory searches update access_count without deadlocking. Memory search always returns its results (a best-effort access-count bookkeeping failure should never fail the search itself).

📱 Actual Behavior

Concurrent searches deadlock in Postgres. Root cause:

SearchAgentMemory runs, per search, as its own autocommit transaction:

UPDATE memory SET access_count = access_count + 1 WHERE id = ANY($1::text[]);

Postgres acquires the row locks in scan order, which can differ between two concurrent statements over an overlapping id set → circular wait → SQLSTATE 40P01. One statement is killed; the search returns 500. Controller log (repeats, with client backoff 1.16→2.16→3.17→4.2→5.2s):

{"level":"info","logger":"http-helpers","msg":"Responding with error","statusCode":500,"message":"search failed: failed to increment access count: ERROR: deadlock detected (SQLSTATE 40P01)"}

Compounding factors:

  • PrefetchMemoryTool searches memory once per sentence in parallel, which manufactures the concurrent overlapping searches.
  • The task store re-POSTs the same task_id on every streaming event, amplifying write contention on the same Postgres.

💻 Environment

  • OS and version: N/A (server-side controller + Python runtime, Linux containers)
  • Kubernetes version: 1.34 (AWS EKS)
  • Kubernetes provider: AWS EKS
  • Browser (if applicable): N/A
  • Application version: kagent controller from current main (reproduced on our v1.5.0 build; the code path is unchanged on main)
  • Database: external Postgres (reproducible on any Postgres provider — the deadlock is in the SQL, not the provider)

🔧 CLI Bug Report

N/A — this is a server-side controller/agent-runtime bug, not a CLI issue, so kagent bug-report does not apply.

🔍 Additional Context

📋 Logs

Controller (repeats rapidly during a burst of concurrent memory searches):

{"level":"info","ts":"2026-06-30T13:54:56Z","logger":"http-helpers","msg":"Responding with error","statusCode":500,"message":"search failed: failed to increment access count: ERROR: deadlock detected (SQLSTATE 40P01)"}
{"level":"info","ts":"2026-06-30T13:54:57Z","logger":"http-helpers","msg":"Responding with error","statusCode":500,"message":"search failed: failed to increment access count: ERROR: deadlock detected (SQLSTATE 40P01)"}
... (12 in ~12s in our capture)

Agent-runtime side (client retries with linear backoff, then gives up):
  retry in 1.16s → 2.16s → 3.17s → 4.20s → 5.20s
  WARNING  No memories found

📷 Screenshots

No response

🙋 Are you willing to contribute?

  • I am willing to submit a PR to fix this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions