Skip to content

Add cooldown after metrics upload failure to prevent memory spikes#1700

Open
Siddhant-K-code wants to merge 4 commits into
mainfrom
fix/metrics-delivery-failure-memory-spike
Open

Add cooldown after metrics upload failure to prevent memory spikes#1700
Siddhant-K-code wants to merge 4 commits into
mainfrom
fix/metrics-delivery-failure-memory-spike

Conversation

@Siddhant-K-code

@Siddhant-K-code Siddhant-K-code commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

When metrics delivery fails (unreachable server, non-prod URL configured), the flush loop was attempting uploads every 3s — each blocking up to 30s on HTTP timeout while holding ~300MB of deserialized event data (Vec<MetricRecord> + Vec<MetricEvent> + serialized JSON POST body) in memory. With METRICS_UPLOAD_AVAILABLE staying true, the stream worker's backpressure also counted undrainable pending DB rows, defeating the throttle.

This adds a 5-minute cooldown (METRICS_UPLOAD_LAST_FAILURE_AT) after delivery failure:

// Before: flush_pending_metrics always attempted upload if metrics_upload_allowed() == true
// After: skips upload during cooldown, sets METRICS_UPLOAD_AVAILABLE = false
let last_failure = METRICS_UPLOAD_LAST_FAILURE_AT.load(Ordering::Relaxed);
if now_secs.saturating_sub(last_failure) < METRICS_UPLOAD_FAILURE_COOLDOWN.as_secs() {
    METRICS_UPLOAD_AVAILABLE.store(false, Ordering::Relaxed);
    return; // events still persist to SQLite, just no upload attempt
}

Both flush_metrics (buffer-driven path) and flush_pending_metrics (periodic drain path) respect the cooldown. On successful delivery after cooldown expires, the failure state is cleared. Row-level retry (mark_records_failed with exponential backoff) still handles eventual delivery once the server is reachable again.


Open in Devin Review

When metrics delivery fails (e.g., unreachable server, non-prod URL),
the flush loop previously kept attempting uploads every 3s, each blocking
for up to 30s on HTTP timeout while holding ~300MB of deserialized event
data in memory.

This adds a 5-minute cooldown after a delivery failure:
- METRICS_UPLOAD_LAST_FAILURE_AT tracks the last failure timestamp
- During cooldown, flush_pending_metrics and flush_metrics skip upload
  attempts entirely (events are still persisted to SQLite)
- METRICS_UPLOAD_AVAILABLE is set to false during cooldown so the
  stream worker backpressure check doesn't count pending DB rows
  that can't be drained anyway
- After cooldown expires, a single retry determines if the server
  is back; on success the failure state is cleared

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@Siddhant-K-code Siddhant-K-code self-assigned this Jun 30, 2026
@devin-ai-integration

Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@Siddhant-K-code Siddhant-K-code marked this pull request as draft June 30, 2026 16:47
devin-ai-integration[bot]

This comment was marked as resolved.

@Siddhant-K-code Siddhant-K-code marked this pull request as ready for review June 30, 2026 17:20
@Siddhant-K-code

Copy link
Copy Markdown
Collaborator Author

With this^, it starts with ~300MB, and gradually slows down to ~20 MB or so in 10-15s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants