Upper Bound on attempted_by / Dangers of Snoozing
#972
jackHedaya
started this conversation in
General
Replies: 1 comment 4 replies
-
|
Hey @jackHedaya, thanks for reporting this and sorry you ran into this issue. I think your issue highlights that it is likely prudent for us to put limits all jsonb arrays to prevent infinite growth. I think we should be able to cleanly support ~indefinite snoozing with a few minor tweaks like this. Thoughts @brandur? |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all!
I want to start by acknowledging that this issue resulted from our misuse of RiverQueue, not a bug in the library itself. However, I'm sharing this experience in case there's an opportunity to add safeguards against similar misuse patterns or to document them somewhere.
Context
My team uses RiverQueue for jobs that wait for incoming webhooks. When a job runs and the webhook hasn't been received yet (checked via database state), the job snoozes itself to retry later.
The Problem
We initially set a (too aggressive) 10-second snooze duration, expecting webhooks to arrive quickly after requests. This was fine initially, but caused an issue when staging was misconfigured -- jobs would snooze indefinitely in tight loops.
While endless retrying is conceptually problematic, the real impact was much worse. River appends to the attempted_by field on each execution without bounds. Our job records silently grew to enormous sizes. In a single month, we had 51TB of inter-AZ traffic costs on our AWS staging account before discovering the issue.
Ideas for Safeguards
I'm also wondering: are there other implementation details in River that make endless snoozing dangerous?
cc @magaldima @themaxgoldman for vis
Beta Was this translation helpful? Give feedback.
All reactions