-
Notifications
You must be signed in to change notification settings - Fork 247
Fix race condition adding background run after adding the job to Bullmq #1815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition adding background run after adding the job to Bullmq #1815
Conversation
3dc9378
to
04132cb
Compare
run is set in Redis
72c3da9
to
5eb7ca7
Compare
const repository = new RunsRepository(workspace.id, project.id) | ||
const creating = await repository.create({ runUuid, queuedAt: new Date() }) | ||
if (creating.error) return Result.error(creating.error) | ||
const run = creating.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the issue. By the time we add this to the active runs the bull mq was already run and start failed
packages/core/src/cache/index.ts
Outdated
export const withCacheLock = async <T>({ | ||
lockKey, | ||
callbackFn, | ||
timeout = 5000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we reduce this timeout? 5 seconds is huge for redis, should be one at most 1 second and that's already stretching it
Co-authored-by: Alex Rodríguez <[email protected]>
packages/core/src/cache/index.ts
Outdated
export const withCacheLock = async <T>({ | ||
lockKey, | ||
callbackFn, | ||
timeout = 5000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout = 5000, | |
timeout = 1000, |
What?
A background run job finished, but the UI didn't noticed and tried to attach to it.
What was the problem?
I conducted a stress test by running multiple background tasks simultaneously. Doing that, I saw that in the enqueue phase, we enqueue first in BullMQ, and later we add it to active runs.
Also I added backoff retry on the lock mechanism so we have more contention in case of many concurrent writes. But the main fix is the race condition enqueuing first and not adding first to active runs