Skip to content

Conversation

andresgutgon
Copy link
Contributor

@andresgutgon andresgutgon commented Oct 15, 2025

What?

A background run job finished, but the UI didn't noticed and tried to attach to it.

What was the problem?

I conducted a stress test by running multiple background tasks simultaneously. Doing that, I saw that in the enqueue phase, we enqueue first in BullMQ, and later we add it to active runs.

Also I added backoff retry on the lock mechanism so we have more contention in case of many concurrent writes. But the main fix is the race condition enqueuing first and not adding first to active runs

@andresgutgon andresgutgon added the 🚧 wip Work in progress label Oct 15, 2025
@andresgutgon andresgutgon force-pushed the fix/check-complete-event-on-background-run branch from 3dc9378 to 04132cb Compare October 16, 2025 11:07
@andresgutgon andresgutgon force-pushed the fix/check-complete-event-on-background-run branch from 72c3da9 to 5eb7ca7 Compare October 16, 2025 16:26
@andresgutgon andresgutgon removed the 🚧 wip Work in progress label Oct 16, 2025
const repository = new RunsRepository(workspace.id, project.id)
const creating = await repository.create({ runUuid, queuedAt: new Date() })
if (creating.error) return Result.error(creating.error)
const run = creating.value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the issue. By the time we add this to the active runs the bull mq was already run and start failed

@andresgutgon andresgutgon changed the title Fix missing background run job by defending the non-existence of the job on the UI Fix race condition adding background run after adding the job to Bullmq Oct 17, 2025
export const withCacheLock = async <T>({
lockKey,
callbackFn,
timeout = 5000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we reduce this timeout? 5 seconds is huge for redis, should be one at most 1 second and that's already stretching it

geclos
geclos previously approved these changes Oct 17, 2025
neoxelox
neoxelox previously approved these changes Oct 17, 2025
@andresgutgon andresgutgon dismissed stale reviews from neoxelox and geclos via 0737fc0 October 17, 2025 09:55
export const withCacheLock = async <T>({
lockKey,
callbackFn,
timeout = 5000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout = 5000,
timeout = 1000,

@andresgutgon andresgutgon merged commit c38aded into main Oct 17, 2025
7 checks passed
@andresgutgon andresgutgon deleted the fix/check-complete-event-on-background-run branch October 17, 2025 10:14
@github-actions github-actions bot locked and limited conversation to collaborators Oct 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants