Skip to content

fix(streams): cut SSE error-log volume and add Redis pool headroom#340

Merged
danielmillerp merged 1 commit into
mainfrom
dm/sse-redis-pool-headroom-log-backoff
Jun 25, 2026
Merged

fix(streams): cut SSE error-log volume and add Redis pool headroom#340
danielmillerp merged 1 commit into
mainfrom
dm/sse-redis-pool-headroom-log-backoff

Conversation

@danielmillerp

@danielmillerp danielmillerp commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What

Short-term mitigations for unbounded log volume produced by the task-event SSE endpoint when concurrent streams exceed the Redis connection pool size.

Background

The SSE streaming path (StreamsUseCase.stream_task_events) holds one blocking XREAD connection per connected client. On a high enough number of concurrent streams the pool is exhausted, so every client's read fails on each cycle. Two things turned that into a log firehose:

  1. The ~20× multiplier was a log-format problem, not a code problem. The error handler already used exc_info=True (it always has). Under a plain-text formatter, exc_info=True renders the traceback as ~20 physical lines, and the cluster log collector ingests one entry per line — so every error became ~20 entries. "Switch to exc_info=True" would have been a no-op; the real cause was the formatter.
  2. The handler retried with a flat sleep(1), so failures repeated ~once/sec per client.

Changes

utils/logging.py — structured JSON logs by default (this removes the 20×)

  • Default to the (already-existing) CustomJSONFormatter whenever ENVIRONMENT != "development". With JSON, the same always-on exc_info=True traceback is captured as a single log entry — the newlines live inside the quoted exc_info field — instead of one entry per line. This is what eliminates the ~20× multiplier; exc_info itself is unchanged.
  • The JSON formatter already existed but was gated behind Datadog configuration, so non-Datadog clusters (e.g. plain-text Azure) silently fell back to per-line tracebacks. Local development still uses plain text for readable console output.
  • Verified: a full traceback emits 1 physical line under JSON vs ~20 under plain text; env matrix confirmed (development → plain text, production/unset → JSON).

streams_use_case.py — throttle the error loop

  • Replace the flat 1s retry with capped exponential backoff (1→2→4→8→16→30s), reset on a healthy read, so a tight per-client loop can't hammer Redis or flood logs.
  • Full tracebacks are still logged on every failure (nothing swallowed), with a failure counter for context. Volume is bounded by the backoff and by the single-entry JSON logging above.

environment_variables.py — pool headroom

  • Bump in-code REDIS_MAX_CONNECTIONS default 50 → 200.

What this is NOT

A mitigation, not a root-cause fix. Connections still scale 1:1 with clients, so a large enough concurrent-stream count will still exhaust the pool — these changes keep that from becoming a log flood and raise the threshold. The durable fix (tracked separately) is a shared per-pod reader that fans out to in-process queues, so connection count becomes O(distinct streams) instead of O(clients). OTel duplicate-handler de-dup is also a separate follow-up.

Testing

  • ruff check on all changed files — passes.
  • Verified JSON formatter emits a full traceback as 1 physical line; env matrix confirmed.
  • Backoff sequence verified (1, 2, 4, 8, 16, 30, 30… cap).
  • No unit test covers the streaming path; the integration test requires testcontainers. The streaming integration tests (00-sync-020-streaming, 10-async-00-base-020-streaming) pass in CI.

🤖 Generated with Claude Code

Greptile Summary

This PR reduces SSE error-log volume and raises Redis pool headroom. The main changes are:

  • Production and unset environments now use JSON logging by default.
  • JSON log fields are capped to limit oversized structured entries.
  • SSE stream read failures now use capped exponential backoff.
  • The default Redis connection pool size is raised to 200.

Confidence Score: 3/5

The logging mitigation is useful, but request payload fields still need attention before merging because structured JSON output can retain small request data.

The changed files are narrow and the Redis/backoff behavior is straightforward, while the logging change leaves an important data-exposure path active for non-development environments.

agentex/src/utils/logging.py

T-Rex T-Rex Logs

What T-Rex did

  • The focused runtime harness was executed to reproduce logging behavior using the real log_request path and CustomJSONFormatter with ENVIRONMENT unset for JSON logging.
  • Tests showed plain-text logging emitted multi-line output for unset/production/dev, and after enabling CustomJSONFormatter for unset and production, head logs emitted JSON lines, while development head stayed plain text.
  • The SSE error backoff and logging were exercised: the before artifact showed flat sleeps and eight SSE error events with exc_info on every error, and the after artifact showed capped exponential sleeps after a healthy read, with the eighth error indicating a reset and all logger calls still using exc_info.
  • Defaults and env-loading behavior were compared: before, base class/direct unset default was 50, env-loading unset default 100, and refresh override 123; after, head class/direct unset default and refresh/env-loading unset default both became 200, with the refresh override still 123, and direct construction with an env var set continues to return the class default since os.environ is only read by the env-loading path via refresh.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (2)

  1. agentex/src/config/environment_variables.py, line 172-174 (link)

    P1 Default is bypassed

    The new REDIS_MAX_CONNECTIONS = 200 class default is not used by the normal app startup path. GlobalDependencies calls EnvironmentVariables.refresh(), and this constructor argument still falls back to "100" when the env var is absent. A deployment without an explicit REDIS_MAX_CONNECTIONS still initializes the Redis pool with 100 connections, so the intended pool-headroom mitigation does not apply.

    Artifacts

    Repro: focused EnvironmentVariables.refresh harness

    • Contains supporting evidence from the run (text/x-python; charset=utf-8).

    Repro: refresh output showing effective Redis max connections is 100

    • Keeps the command output available without making the summary code-heavy.

    View artifacts

    T-Rex Ran code and verified through T-Rex

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/src/config/environment_variables.py
    Line: 172-174
    
    Comment:
    **Default is bypassed**
    
    The new `REDIS_MAX_CONNECTIONS = 200` class default is not used by the normal app startup path. `GlobalDependencies` calls `EnvironmentVariables.refresh()`, and this constructor argument still falls back to `"100"` when the env var is absent. A deployment without an explicit `REDIS_MAX_CONNECTIONS` still initializes the Redis pool with 100 connections, so the intended pool-headroom mitigation does not apply.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

  2. General comment

    P1 Repeated SSE stream read failures still emit full tracebacks

    • Bug
    • Cause
      • In agentex/src/domain/use_cases/streams_use_case.py, the stream error handler always passes exc_info=True to logger.error for every consecutive failure rather than conditioning traceback logging on the first failure after a healthy read.
    • Fix
      • Change the error handler to pass exc_info=True only when consecutive_errors == 1; for subsequent consecutive failures, log a compact message with the failure/repeat count and exc_info=False or omit exc_info. Keep the existing reset of consecutive_errors = 0 after a successful read cycle.

    T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
agentex/src/utils/logging.py:165-168
**Small bodies still log**

This only caps oversized structured fields; it does not stop JSON logging from serializing request `extra` fields. `LoggedAPIRoute.log_request` still passes `body`, `headers`, and `query_params`, and `strip_sensitive_items` only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.

Reviews (5): Last reviewed commit: "fix(streams): cut SSE error-log volume a..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

Context used:

  • Rule used - What: Never log full response bodies, request bodi... (source)

@danielmillerp danielmillerp requested a review from a team as a code owner June 25, 2026 14:34
@danielmillerp danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch from a39f825 to c279c97 Compare June 25, 2026 14:57
@danielmillerp danielmillerp changed the title fix(streams): throttle SSE error loop and add Redis pool headroom fix(streams): cut SSE error-log volume and add Redis pool headroom Jun 25, 2026
@danielmillerp danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch 2 times, most recently from 45b91ff to 8e925c4 Compare June 25, 2026 15:08
@danielmillerp

Copy link
Copy Markdown
Collaborator Author

Addressing the two Greptile P1s:

P1 — "Default is bypassed" (environment_variables.py): fixed. Good catch. GlobalDependencies builds config via EnvironmentVariables.refresh(), which read os.environ.get(REDIS_MAX_CONNECTIONS, "100") — so the class-attribute default never applied on the real startup path and an env-var-less deployment would still get 100. Updated the refresh() fallback to "200" to match the class default. Verified: refresh(force_refresh=True) now returns 200 with no env var set, and an explicit REDIS_MAX_CONNECTIONS still overrides.

P1 — "Repeated failures still emit full tracebacks": intended, not a regression. This flags the PR against an earlier version of its own description. The design was deliberately changed: we now log the full traceback on every failure (nothing swallowed) rather than first-failure-only. The ~20× volume problem is solved structurally by the JSON logging change (each traceback is a single log entry, regardless of how many failures), and the repeat rate is bounded by the exponential backoff. So full diagnostics are retained without re-introducing the multiplier. The PR description has been updated to match. Leaving the exc_info=True on every failure intentionally.

Comment thread agentex/src/utils/logging.py
@danielmillerp danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch from 8e925c4 to e3718ba Compare June 25, 2026 15:26
@danielmillerp

Copy link
Copy Markdown
Collaborator Author

Capacity sanity-check for the REDIS_MAX_CONNECTIONS 50→200 bump (via Datadog, SGP clusters that report):

  • Redis server cap (redis.net.maxclients): 10,000 on every reporting SGP cluster (self-hosted in-cluster Redis, default cap).
  • Peak connected clients (redis.net.clients / custom redis.connections.current): ~90–143 total — roughly 1% utilization.
  • Rejections (redis.connections.rejected): 0 historically.

So pods × 200 would need to exceed 10,000 (50+ pods all maxed at once) to be a problem — not realistic. 200/pod is safe with large margin on the standard self-hosted-Redis setup. (Side note: observed peak of 143 is already above the old 50/100 defaults, so the bump is evidence-backed.)

Caveat: this assumes the deployment uses the standard in-cluster Redis (maxclients 10000). A managed Redis on a small tier with a low per-tier connection cap would be the only place to double-check before relying on 200 there.

The task-event SSE endpoint holds one blocking XREAD connection per
connected client. When concurrent streams exceed the Redis connection
pool size, every client's read fails on each cycle. Combined with
plain-text logging that splits each traceback into one log entry per
line, and a flat 1s retry, a sustained pool stall becomes a
log-ingestion firehose.

Changes:

- Streaming error path: replace the flat 1s retry with capped
  exponential backoff (1->2->4->8->16->30s), reset on a healthy read,
  so a tight per-client loop can't hammer Redis or flood logs. Full
  tracebacks are still logged on every failure (nothing swallowed) with
  a failure counter for context; volume is bounded by the backoff and by
  single-entry JSON logging below.

- Default to structured JSON logs in all deployed environments
  (ENVIRONMENT != "development"). JSON keeps a multi-line traceback as a
  single log entry (newlines live inside the quoted exc_info field)
  instead of one cluster-log entry per line. Local development keeps
  plain text for readable console output. The JSON formatter already
  existed but was gated behind Datadog configuration.

- Bound per-field size in the JSON formatter. The JSON formatter
  serializes `extra` (the plain formatter dropped it), and request
  logging attaches body/headers/query_params as `extra`, so an unbounded
  payload could create large per-request entries. Cap each field at 4KB
  (exempting exc_info) so large payloads can't reintroduce volume.

- Bump the in-code REDIS_MAX_CONNECTIONS default (50 -> 200) and the
  refresh() fallback (the real startup path) for peak-concurrency
  headroom. This is a mitigation, not a fix: connections still scale 1:1
  with clients. The durable fix is a shared per-pod reader that fans out
  to in-process queues. Deployed environments set the real cap via the
  REDIS_MAX_CONNECTIONS env var.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@danielmillerp danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch from e3718ba to e9e9e94 Compare June 25, 2026 15:42
@danielmillerp

Copy link
Copy Markdown
Collaborator Author

Greptile P1 "JSON emits request bodies" — addressed (volume), with context on exposure.

You're right that the JSON formatter serializes extra while the plain formatter dropped it, so LoggedAPIRoute.log_request (body/headers/query_params) now reaches structured fields on non-Datadog clusters. Fixed the volume risk: CustomJSONFormatter now caps each field at 4KB (_MAX_JSON_FIELD_CHARS), exempting exc_info. A large payload is truncated with a ...[truncated N chars] marker instead of creating an oversized per-request entry. Verified: a 20KB body and 9KB headers each truncate to ~4KB, the message (which interpolates the body) is also capped, small fields and full tracebacks are untouched.

Two clarifications on the exposure angle:

  • Request bodies were already logged on these clusters today — log_request interpolates the decoded body into the message string (not just extra), so plain-text clusters already emit it. This change doesn't newly expose the body; it adds headers/query_params as fields and (now-bounded) duplicates the body.
  • All three (body, headers, query_params) pass through strip_sensitive_items (key blacklist: api_key/token/password/secret/cookie/authorization/acting-user), and messages additionally pass the value-level redact_sensitive_text filter. So sensitive-key data is stripped before logging.

Net: volume is now bounded per field; sensitive-key scrubbing already applied. Deeper structured-field redaction (value-level redaction on extra, not just keys) is worth a follow-up but is pre-existing Datadog-cluster behavior, out of scope here.

Comment on lines +165 to +168
return {
k: v if k in _UNCAPPED_JSON_FIELDS else _truncate_log_value(v)
for k, v in extra.items()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Small bodies still log

This only caps oversized structured fields; it does not stop JSON logging from serializing request extra fields. LoggedAPIRoute.log_request still passes body, headers, and query_params, and strip_sensitive_items only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.

Rule Used: What: Never log full response bodies, request bodi... (source)

Artifacts

Repro: focused runtime harness exercising log_request and JSON formatting

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: execution output showing serialized body headers and query_params in JSON log

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/utils/logging.py
Line: 165-168

Comment:
**Small bodies still log**

This only caps oversized structured fields; it does not stop JSON logging from serializing request `extra` fields. `LoggedAPIRoute.log_request` still passes `body`, `headers`, and `query_params`, and `strip_sensitive_items` only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.

**Rule Used:** What: Never log full response bodies, request bodi... ([source](https://app.greptile.com/scale-ai/-/custom-context?memory=fa8d684f-4686-4f3e-b1ef-c27453f614ea))

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a fair principle, but it's pre-existing behavior rather than something this PR introduces, so handling it as a scoped follow-up:

  • log_request already logs the request body today (it interpolates the decoded body into the message string, not just extra), so request-body logging predates this change on every cluster. This PR adds a now-bounded structured copy; it doesn't turn body logging on.
  • Sensitive keys are already stripped (strip_sensitive_items: api_key/token/password/secret/cookie/authorization/acting-user), and messages also run value-level redact_sensitive_text.
  • Per-field volume is now capped at 4KB.

Fully satisfying the "never log request bodies" rule means dropping/allowlisting body+headers+query_params in log_request, which changes established request-logging behavior on all clusters (incl. Datadog) — out of scope for a log-volume mitigation. Filed as a follow-up: AGX1-405. Merging this PR (approved + green) and addressing body logging there.

@danielmillerp danielmillerp merged commit 352eaaa into main Jun 25, 2026
31 checks passed
@danielmillerp danielmillerp deleted the dm/sse-redis-pool-headroom-log-backoff branch June 25, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants