fix(streams): cut SSE error-log volume and add Redis pool headroom by danielmillerp · Pull Request #340 · scaleapi/scale-agentex

danielmillerp · 2026-06-25T14:34:37Z

What

Short-term mitigations for unbounded log volume produced by the task-event SSE endpoint when concurrent streams exceed the Redis connection pool size.

Background

The SSE streaming path (StreamsUseCase.stream_task_events) holds one blocking XREAD connection per connected client. On a high enough number of concurrent streams the pool is exhausted, so every client's read fails on each cycle. Two things turned that into a log firehose:

The ~20× multiplier was a log-format problem, not a code problem. The error handler already used exc_info=True (it always has). Under a plain-text formatter, exc_info=True renders the traceback as ~20 physical lines, and the cluster log collector ingests one entry per line — so every error became ~20 entries. "Switch to exc_info=True" would have been a no-op; the real cause was the formatter.
The handler retried with a flat sleep(1), so failures repeated ~once/sec per client.

Changes

utils/logging.py — structured JSON logs by default (this removes the 20×)

Default to the (already-existing) CustomJSONFormatter whenever ENVIRONMENT != "development". With JSON, the same always-on exc_info=True traceback is captured as a single log entry — the newlines live inside the quoted exc_info field — instead of one entry per line. This is what eliminates the ~20× multiplier; exc_info itself is unchanged.
The JSON formatter already existed but was gated behind Datadog configuration, so non-Datadog clusters (e.g. plain-text Azure) silently fell back to per-line tracebacks. Local development still uses plain text for readable console output.
Verified: a full traceback emits 1 physical line under JSON vs ~20 under plain text; env matrix confirmed (development → plain text, production/unset → JSON).

streams_use_case.py — throttle the error loop

Replace the flat 1s retry with capped exponential backoff (1→2→4→8→16→30s), reset on a healthy read, so a tight per-client loop can't hammer Redis or flood logs.
Full tracebacks are still logged on every failure (nothing swallowed), with a failure counter for context. Volume is bounded by the backoff and by the single-entry JSON logging above.

environment_variables.py — pool headroom

Bump in-code REDIS_MAX_CONNECTIONS default 50 → 200.

What this is NOT

A mitigation, not a root-cause fix. Connections still scale 1:1 with clients, so a large enough concurrent-stream count will still exhaust the pool — these changes keep that from becoming a log flood and raise the threshold. The durable fix (tracked separately) is a shared per-pod reader that fans out to in-process queues, so connection count becomes O(distinct streams) instead of O(clients). OTel duplicate-handler de-dup is also a separate follow-up.

Testing

ruff check on all changed files — passes.
Verified JSON formatter emits a full traceback as 1 physical line; env matrix confirmed.
Backoff sequence verified (1, 2, 4, 8, 16, 30, 30… cap).
No unit test covers the streaming path; the integration test requires testcontainers. The streaming integration tests (00-sync-020-streaming, 10-async-00-base-020-streaming) pass in CI.

🤖 Generated with Claude Code

Greptile Summary

This PR reduces SSE error-log volume and raises Redis pool headroom. The main changes are:

Production and unset environments now use JSON logging by default.
JSON log fields are capped to limit oversized structured entries.
SSE stream read failures now use capped exponential backoff.
The default Redis connection pool size is raised to 200.

Confidence Score: 3/5

The logging mitigation is useful, but request payload fields still need attention before merging because structured JSON output can retain small request data.

The changed files are narrow and the Redis/backoff behavior is straightforward, while the logging change leaves an important data-exposure path active for non-development environments.

agentex/src/utils/logging.py

T-Rex Logs

What T-Rex did

The focused runtime harness was executed to reproduce logging behavior using the real log_request path and CustomJSONFormatter with ENVIRONMENT unset for JSON logging.
Tests showed plain-text logging emitted multi-line output for unset/production/dev, and after enabling CustomJSONFormatter for unset and production, head logs emitted JSON lines, while development head stayed plain text.
The SSE error backoff and logging were exercised: the before artifact showed flat sleeps and eight SSE error events with exc_info on every error, and the after artifact showed capped exponential sleeps after a healthy read, with the eighth error indicating a reset and all logger calls still using exc_info.
Defaults and env-loading behavior were compared: before, base class/direct unset default was 50, env-loading unset default 100, and refresh override 123; after, head class/direct unset default and refresh/env-loading unset default both became 200, with the refresh override still 123, and direct construction with an env var set continues to return the class default since os.environ is only read by the env-loading path via refresh.

_{Ran code and verified through T-Rex}

Comments Outside Diff (2)

agentex/src/config/environment_variables.py, line 172-174 (link)

Default is bypassed

The new REDIS_MAX_CONNECTIONS = 200 class default is not used by the normal app startup path. GlobalDependencies calls EnvironmentVariables.refresh(), and this constructor argument still falls back to "100" when the env var is absent. A deployment without an explicit REDIS_MAX_CONNECTIONS still initializes the Redis pool with 100 connections, so the intended pool-headroom mitigation does not apply.
Artifacts

Repro: focused EnvironmentVariables.refresh harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: refresh output showing effective Redis max connections is 100
- Keeps the command output available without making the summary code-heavy.
_{Ran code and verified through T-Rex}
Prompt To Fix With AI
```
This is a comment left during a code review.
Path: agentex/src/config/environment_variables.py
Line: 172-174

Comment:
**Default is bypassed**

The new `REDIS_MAX_CONNECTIONS = 200` class default is not used by the normal app startup path. `GlobalDependencies` calls `EnvironmentVariables.refresh()`, and this constructor argument still falls back to `"100"` when the env var is absent. A deployment without an explicit `REDIS_MAX_CONNECTIONS` still initializes the Redis pool with 100 connections, so the intended pool-headroom mitigation does not apply.

How can I resolve this? If you propose a fix, please make it concise.
```
General comment

Repeated SSE stream read failures still emit full tracebacks
- Bug
  - The PR contract says stream_task_events should emit a full traceback only on the first failure of a stream and log repeated failures as compact single-line records with repeat/failure count. Executed head evidence shows failure Bump next from 15.2.3 to 15.5.6 in /agentex-web #1, Bump cross-spawn from 7.0.3 to 7.0.6 in /agentex-web #2, Bump brace-expansion in /agentex-web #3, and the post-reset failure all call logger.error(..., exc_info=True), so repeated failures still include traceback data. Backoff and reset behavior worked in the harness (1.0, 2.0, 4.0, reset to failure #1 after a healthy read), but traceback throttling did not.
- Cause
  - In agentex/src/domain/use_cases/streams_use_case.py, the stream error handler always passes exc_info=True to logger.error for every consecutive failure rather than conditioning traceback logging on the first failure after a healthy read.
- Fix
  - Change the error handler to pass exc_info=True only when consecutive_errors == 1; for subsequent consecutive failures, log a compact message with the failure/repeat count and exc_info=False or omit exc_info. Keep the existing reset of consecutive_errors = 0 after a successful read cycle.
_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
agentex/src/utils/logging.py:165-168
**Small bodies still log**

This only caps oversized structured fields; it does not stop JSON logging from serializing request `extra` fields. `LoggedAPIRoute.log_request` still passes `body`, `headers`, and `query_params`, and `strip_sensitive_items` only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.

_{Reviews (5): Last reviewed commit: "fix(streams): cut SSE error-log volume a..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

Context used:

Rule used - What: Never log full response bodies, request bodi... (source)

danielmillerp · 2026-06-25T15:08:48Z

Addressing the two Greptile P1s:

P1 — "Default is bypassed" (environment_variables.py): fixed. Good catch. GlobalDependencies builds config via EnvironmentVariables.refresh(), which read os.environ.get(REDIS_MAX_CONNECTIONS, "100") — so the class-attribute default never applied on the real startup path and an env-var-less deployment would still get 100. Updated the refresh() fallback to "200" to match the class default. Verified: refresh(force_refresh=True) now returns 200 with no env var set, and an explicit REDIS_MAX_CONNECTIONS still overrides.

P1 — "Repeated failures still emit full tracebacks": intended, not a regression. This flags the PR against an earlier version of its own description. The design was deliberately changed: we now log the full traceback on every failure (nothing swallowed) rather than first-failure-only. The ~20× volume problem is solved structurally by the JSON logging change (each traceback is a single log entry, regardless of how many failures), and the repeat rate is bounded by the exponential backoff. So full diagnostics are retained without re-introducing the multiplier. The PR description has been updated to match. Leaving the exc_info=True on every failure intentionally.

danielmillerp · 2026-06-25T15:33:11Z

Capacity sanity-check for the REDIS_MAX_CONNECTIONS 50→200 bump (via Datadog, SGP clusters that report):

Redis server cap (redis.net.maxclients): 10,000 on every reporting SGP cluster (self-hosted in-cluster Redis, default cap).
Peak connected clients (redis.net.clients / custom redis.connections.current): ~90–143 total — roughly 1% utilization.
Rejections (redis.connections.rejected): 0 historically.

So pods × 200 would need to exceed 10,000 (50+ pods all maxed at once) to be a problem — not realistic. 200/pod is safe with large margin on the standard self-hosted-Redis setup. (Side note: observed peak of 143 is already above the old 50/100 defaults, so the bump is evidence-backed.)

Caveat: this assumes the deployment uses the standard in-cluster Redis (maxclients 10000). A managed Redis on a small tier with a low per-tier connection cap would be the only place to double-check before relying on 200 there.

The task-event SSE endpoint holds one blocking XREAD connection per connected client. When concurrent streams exceed the Redis connection pool size, every client's read fails on each cycle. Combined with plain-text logging that splits each traceback into one log entry per line, and a flat 1s retry, a sustained pool stall becomes a log-ingestion firehose. Changes: - Streaming error path: replace the flat 1s retry with capped exponential backoff (1->2->4->8->16->30s), reset on a healthy read, so a tight per-client loop can't hammer Redis or flood logs. Full tracebacks are still logged on every failure (nothing swallowed) with a failure counter for context; volume is bounded by the backoff and by single-entry JSON logging below. - Default to structured JSON logs in all deployed environments (ENVIRONMENT != "development"). JSON keeps a multi-line traceback as a single log entry (newlines live inside the quoted exc_info field) instead of one cluster-log entry per line. Local development keeps plain text for readable console output. The JSON formatter already existed but was gated behind Datadog configuration. - Bound per-field size in the JSON formatter. The JSON formatter serializes `extra` (the plain formatter dropped it), and request logging attaches body/headers/query_params as `extra`, so an unbounded payload could create large per-request entries. Cap each field at 4KB (exempting exc_info) so large payloads can't reintroduce volume. - Bump the in-code REDIS_MAX_CONNECTIONS default (50 -> 200) and the refresh() fallback (the real startup path) for peak-concurrency headroom. This is a mitigation, not a fix: connections still scale 1:1 with clients. The durable fix is a shared per-pod reader that fans out to in-process queues. Deployed environments set the real cap via the REDIS_MAX_CONNECTIONS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

danielmillerp · 2026-06-25T15:42:34Z

Greptile P1 "JSON emits request bodies" — addressed (volume), with context on exposure.

You're right that the JSON formatter serializes extra while the plain formatter dropped it, so LoggedAPIRoute.log_request (body/headers/query_params) now reaches structured fields on non-Datadog clusters. Fixed the volume risk: CustomJSONFormatter now caps each field at 4KB (_MAX_JSON_FIELD_CHARS), exempting exc_info. A large payload is truncated with a ...[truncated N chars] marker instead of creating an oversized per-request entry. Verified: a 20KB body and 9KB headers each truncate to ~4KB, the message (which interpolates the body) is also capped, small fields and full tracebacks are untouched.

Two clarifications on the exposure angle:

Request bodies were already logged on these clusters today — log_request interpolates the decoded body into the message string (not just extra), so plain-text clusters already emit it. This change doesn't newly expose the body; it adds headers/query_params as fields and (now-bounded) duplicates the body.
All three (body, headers, query_params) pass through strip_sensitive_items (key blacklist: api_key/token/password/secret/cookie/authorization/acting-user), and messages additionally pass the value-level redact_sensitive_text filter. So sensitive-key data is stripped before logging.

Net: volume is now bounded per field; sensitive-key scrubbing already applied. Deeper structured-field redaction (value-level redaction on extra, not just keys) is worth a follow-up but is pre-existing Datadog-cluster behavior, out of scope here.

greptile-apps · 2026-06-25T15:48:03Z

+        return {
+            k: v if k in _UNCAPPED_JSON_FIELDS else _truncate_log_value(v)
+            for k, v in extra.items()
+        }


Small bodies still log

This only caps oversized structured fields; it does not stop JSON logging from serializing request extra fields. LoggedAPIRoute.log_request still passes body, headers, and query_params, and strip_sensitive_items only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.

Rule Used: What: Never log full response bodies, request bodi... (source)

Artifacts

Repro: focused runtime harness exercising log_request and JSON formatting

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: execution output showing serialized body headers and query_params in JSON log

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: agentex/src/utils/logging.py Line: 165-168 Comment: **Small bodies still log** This only caps oversized structured fields; it does not stop JSON logging from serializing request `extra` fields. `LoggedAPIRoute.log_request` still passes `body`, `headers`, and `query_params`, and `strip_sensitive_items` only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values. **Rule Used:** What: Never log full response bodies, request bodi... ([source](https://app.greptile.com/scale-ai/-/custom-context?memory=fa8d684f-4686-4f3e-b1ef-c27453f614ea)) How can I resolve this? If you propose a fix, please make it concise.

Acknowledged — this is a fair principle, but it's pre-existing behavior rather than something this PR introduces, so handling it as a scoped follow-up:

log_request already logs the request body today (it interpolates the decoded body into the message string, not just extra), so request-body logging predates this change on every cluster. This PR adds a now-bounded structured copy; it doesn't turn body logging on.

Sensitive keys are already stripped (strip_sensitive_items: api_key/token/password/secret/cookie/authorization/acting-user), and messages also run value-level redact_sensitive_text.

Per-field volume is now capped at 4KB.

Fully satisfying the "never log request bodies" rule means dropping/allowlisting body+headers+query_params in log_request, which changes established request-logging behavior on all clusters (incl. Datadog) — out of scope for a log-volume mitigation. Filed as a follow-up: AGX1-405. Merging this PR (approved + green) and addressing body logging there.

danielmillerp requested a review from a team as a code owner June 25, 2026 14:34

danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch from a39f825 to c279c97 Compare June 25, 2026 14:57

danielmillerp changed the title ~~fix(streams): throttle SSE error loop and add Redis pool headroom~~ fix(streams): cut SSE error-log volume and add Redis pool headroom Jun 25, 2026

danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch 2 times, most recently from 45b91ff to 8e925c4 Compare June 25, 2026 15:08

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread agentex/src/utils/logging.py

danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch from 8e925c4 to e3718ba Compare June 25, 2026 15:26

smoreinis approved these changes Jun 25, 2026

View reviewed changes

gergelyhunyady approved these changes Jun 25, 2026

View reviewed changes

danielmillerp force-pushed the dm/sse-redis-pool-headroom-log-backoff branch from e3718ba to e9e9e94 Compare June 25, 2026 15:42

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

danielmillerp merged commit 352eaaa into main Jun 25, 2026
31 checks passed

danielmillerp deleted the dm/sse-redis-pool-headroom-log-backoff branch June 25, 2026 15:51

github-actions Bot mentioned this pull request Jun 25, 2026

chore(main): release 0.2.0 #326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(streams): cut SSE error-log volume and add Redis pool headroom#340

fix(streams): cut SSE error-log volume and add Redis pool headroom#340
danielmillerp merged 1 commit into
mainfrom
dm/sse-redis-pool-headroom-log-backoff

danielmillerp commented Jun 25, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

danielmillerp commented Jun 25, 2026

Uh oh!

Uh oh!

danielmillerp commented Jun 25, 2026

Uh oh!

danielmillerp commented Jun 25, 2026

Uh oh!

greptile-apps Bot Jun 25, 2026

Uh oh!

danielmillerp Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

danielmillerp commented Jun 25, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Background

Changes

What this is NOT

Testing

Greptile Summary

Confidence Score: 3/5

T-Rex Logs

Comments Outside Diff (2)

Uh oh!

danielmillerp commented Jun 25, 2026

Uh oh!

Uh oh!

danielmillerp commented Jun 25, 2026

Uh oh!

danielmillerp commented Jun 25, 2026

Uh oh!

greptile-apps Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

danielmillerp Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielmillerp commented Jun 25, 2026 •

edited by greptile-apps Bot

Loading