fix(streams): cut SSE error-log volume and add Redis pool headroom#340
Conversation
a39f825 to
c279c97
Compare
45b91ff to
8e925c4
Compare
|
Addressing the two Greptile P1s: P1 — "Default is bypassed" (environment_variables.py): fixed. Good catch. P1 — "Repeated failures still emit full tracebacks": intended, not a regression. This flags the PR against an earlier version of its own description. The design was deliberately changed: we now log the full traceback on every failure (nothing swallowed) rather than first-failure-only. The ~20× volume problem is solved structurally by the JSON logging change (each traceback is a single log entry, regardless of how many failures), and the repeat rate is bounded by the exponential backoff. So full diagnostics are retained without re-introducing the multiplier. The PR description has been updated to match. Leaving the |
8e925c4 to
e3718ba
Compare
|
Capacity sanity-check for the
So Caveat: this assumes the deployment uses the standard in-cluster Redis ( |
The task-event SSE endpoint holds one blocking XREAD connection per connected client. When concurrent streams exceed the Redis connection pool size, every client's read fails on each cycle. Combined with plain-text logging that splits each traceback into one log entry per line, and a flat 1s retry, a sustained pool stall becomes a log-ingestion firehose. Changes: - Streaming error path: replace the flat 1s retry with capped exponential backoff (1->2->4->8->16->30s), reset on a healthy read, so a tight per-client loop can't hammer Redis or flood logs. Full tracebacks are still logged on every failure (nothing swallowed) with a failure counter for context; volume is bounded by the backoff and by single-entry JSON logging below. - Default to structured JSON logs in all deployed environments (ENVIRONMENT != "development"). JSON keeps a multi-line traceback as a single log entry (newlines live inside the quoted exc_info field) instead of one cluster-log entry per line. Local development keeps plain text for readable console output. The JSON formatter already existed but was gated behind Datadog configuration. - Bound per-field size in the JSON formatter. The JSON formatter serializes `extra` (the plain formatter dropped it), and request logging attaches body/headers/query_params as `extra`, so an unbounded payload could create large per-request entries. Cap each field at 4KB (exempting exc_info) so large payloads can't reintroduce volume. - Bump the in-code REDIS_MAX_CONNECTIONS default (50 -> 200) and the refresh() fallback (the real startup path) for peak-concurrency headroom. This is a mitigation, not a fix: connections still scale 1:1 with clients. The durable fix is a shared per-pod reader that fans out to in-process queues. Deployed environments set the real cap via the REDIS_MAX_CONNECTIONS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e3718ba to
e9e9e94
Compare
|
Greptile P1 "JSON emits request bodies" — addressed (volume), with context on exposure. You're right that the JSON formatter serializes Two clarifications on the exposure angle:
Net: volume is now bounded per field; sensitive-key scrubbing already applied. Deeper structured-field redaction (value-level redaction on |
| return { | ||
| k: v if k in _UNCAPPED_JSON_FIELDS else _truncate_log_value(v) | ||
| for k, v in extra.items() | ||
| } |
There was a problem hiding this comment.
This only caps oversized structured fields; it does not stop JSON logging from serializing request extra fields. LoggedAPIRoute.log_request still passes body, headers, and query_params, and strip_sensitive_items only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.
Rule Used: What: Never log full response bodies, request bodi... (source)
Artifacts
Repro: focused runtime harness exercising log_request and JSON formatting
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: execution output showing serialized body headers and query_params in JSON log
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/utils/logging.py
Line: 165-168
Comment:
**Small bodies still log**
This only caps oversized structured fields; it does not stop JSON logging from serializing request `extra` fields. `LoggedAPIRoute.log_request` still passes `body`, `headers`, and `query_params`, and `strip_sensitive_items` only removes blacklisted keys. In non-development environments, any non-blacklisted request body under 4096 characters is still emitted into production logs, which can expose request payload data and keeps per-request body logging enabled. Please drop or allowlist these structured request fields before JSON serialization rather than only truncating large values.
**Rule Used:** What: Never log full response bodies, request bodi... ([source](https://app.greptile.com/scale-ai/-/custom-context?memory=fa8d684f-4686-4f3e-b1ef-c27453f614ea))
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Acknowledged — this is a fair principle, but it's pre-existing behavior rather than something this PR introduces, so handling it as a scoped follow-up:
log_requestalready logs the request body today (it interpolates the decoded body into the message string, not justextra), so request-body logging predates this change on every cluster. This PR adds a now-bounded structured copy; it doesn't turn body logging on.- Sensitive keys are already stripped (
strip_sensitive_items: api_key/token/password/secret/cookie/authorization/acting-user), and messages also run value-levelredact_sensitive_text. - Per-field volume is now capped at 4KB.
Fully satisfying the "never log request bodies" rule means dropping/allowlisting body+headers+query_params in log_request, which changes established request-logging behavior on all clusters (incl. Datadog) — out of scope for a log-volume mitigation. Filed as a follow-up: AGX1-405. Merging this PR (approved + green) and addressing body logging there.
What
Short-term mitigations for unbounded log volume produced by the task-event SSE endpoint when concurrent streams exceed the Redis connection pool size.
Background
The SSE streaming path (
StreamsUseCase.stream_task_events) holds one blockingXREADconnection per connected client. On a high enough number of concurrent streams the pool is exhausted, so every client's read fails on each cycle. Two things turned that into a log firehose:exc_info=True(it always has). Under a plain-text formatter,exc_info=Truerenders the traceback as ~20 physical lines, and the cluster log collector ingests one entry per line — so every error became ~20 entries. "Switch to exc_info=True" would have been a no-op; the real cause was the formatter.sleep(1), so failures repeated ~once/sec per client.Changes
utils/logging.py— structured JSON logs by default (this removes the 20×)CustomJSONFormatterwheneverENVIRONMENT != "development". With JSON, the same always-onexc_info=Truetraceback is captured as a single log entry — the newlines live inside the quotedexc_infofield — instead of one entry per line. This is what eliminates the ~20× multiplier;exc_infoitself is unchanged.streams_use_case.py— throttle the error loopenvironment_variables.py— pool headroomREDIS_MAX_CONNECTIONSdefault 50 → 200.What this is NOT
A mitigation, not a root-cause fix. Connections still scale 1:1 with clients, so a large enough concurrent-stream count will still exhaust the pool — these changes keep that from becoming a log flood and raise the threshold. The durable fix (tracked separately) is a shared per-pod reader that fans out to in-process queues, so connection count becomes O(distinct streams) instead of O(clients). OTel duplicate-handler de-dup is also a separate follow-up.
Testing
ruff checkon all changed files — passes.00-sync-020-streaming,10-async-00-base-020-streaming) pass in CI.🤖 Generated with Claude Code
Greptile Summary
This PR reduces SSE error-log volume and raises Redis pool headroom. The main changes are:
Confidence Score: 3/5
The logging mitigation is useful, but request payload fields still need attention before merging because structured JSON output can retain small request data.
The changed files are narrow and the Redis/backoff behavior is straightforward, while the logging change leaves an important data-exposure path active for non-development environments.
agentex/src/utils/logging.py
What T-Rex did
Comments Outside Diff (2)
agentex/src/config/environment_variables.py, line 172-174 (link)The new
REDIS_MAX_CONNECTIONS = 200class default is not used by the normal app startup path.GlobalDependenciescallsEnvironmentVariables.refresh(), and this constructor argument still falls back to"100"when the env var is absent. A deployment without an explicitREDIS_MAX_CONNECTIONSstill initializes the Redis pool with 100 connections, so the intended pool-headroom mitigation does not apply.Artifacts
Repro: focused EnvironmentVariables.refresh harness
Repro: refresh output showing effective Redis max connections is 100
Prompt To Fix With AI
General comment
stream_task_eventsshould emit a full traceback only on the first failure of a stream and log repeated failures as compact single-line records with repeat/failure count. Executed head evidence shows failure Bump next from 15.2.3 to 15.5.6 in /agentex-web #1, Bump cross-spawn from 7.0.3 to 7.0.6 in /agentex-web #2, Bump brace-expansion in /agentex-web #3, and the post-reset failure all calllogger.error(..., exc_info=True), so repeated failures still include traceback data. Backoff and reset behavior worked in the harness (1.0, 2.0, 4.0, reset tofailure #1after a healthy read), but traceback throttling did not.agentex/src/domain/use_cases/streams_use_case.py, the stream error handler always passesexc_info=Truetologger.errorfor every consecutive failure rather than conditioning traceback logging on the first failure after a healthy read.exc_info=Trueonly whenconsecutive_errors == 1; for subsequent consecutive failures, log a compact message with the failure/repeat count andexc_info=Falseor omitexc_info. Keep the existing reset ofconsecutive_errors = 0after a successful read cycle.Prompt To Fix All With AI
Reviews (5): Last reviewed commit: "fix(streams): cut SSE error-log volume a..." | Re-trigger Greptile
Context used: