Skip to content

gateway: reduce potential lock contention in gateway forwarder#6741

Merged
tonistiigi merged 1 commit into
moby:masterfrom
jsternberg:forwarding-ping-no-job-id
May 6, 2026
Merged

gateway: reduce potential lock contention in gateway forwarder#6741
tonistiigi merged 1 commit into
moby:masterfrom
jsternberg:forwarding-ping-no-job-id

Conversation

@jsternberg
Copy link
Copy Markdown
Collaborator

There's a large potential for a lock contention issue in the gateway
forwarder's logic. The previous iteration of this would keep a global
mapping of the build ids and, when a forwarder for a build id didn't
exist, the forwarder would wait 3 seconds for the build to register.

The issue with lock contention comes after this. Instead of having a
notification channel that a specific build was ready, the forwarder
would wake up all goroutines that were waiting each time a build was
registered. Since each of those builds took a read lock to check whether
its build was present and registering subsequent builds took a write
lock, it was very easy to end up in a lock contention scenario when
starting many builds at the same time. Then it was easy to hit the 3
second timeout especially when the machine itself was under load.

This changes the notification mechanism so the notify happens per build.
Looking up a build id creates a forwarder registrar with a channel that
can be polled for when the registration is complete. A forwarder will
then only be notified and woken when that specific build id is ready by
the go runtime rather than from the sync condition.

Potentially alleviates how often #5171 will happen.

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be done with a generics-based utility.

@jsternberg
Copy link
Copy Markdown
Collaborator Author

@tonistiigi any ideas on a good name for the part you want me to split out? I'm attempting to look at splitting it out but I'd be removing almost the entirety of the gateway forwarder. Maybe we can defer splitting this out until it's needed somewhere else?

@tonistiigi
Copy link
Copy Markdown
Member

@jsternberg Smth like util/registrar.

Maybe we can defer splitting this out until it's needed somewhere else?

I think it would still be much cleaner with these separation, but if you want, you can leave the generic mechanism private for now instead of adding public pkg for it(although I think the session registration is probably a very similar mechanism that we could look in a follow-up).

There's a large potential for a lock contention issue in the gateway
forwarder's logic. The previous iteration of this would keep a global
mapping of the build ids and, when a forwarder for a build id didn't
exist, the forwarder would wait 3 seconds for the build to register.

The issue with lock contention comes after this. Instead of having a
notification channel that a specific build was ready, the forwarder
would wake up all goroutines that were waiting each time a build was
registered. Since each of those builds took a read lock to check whether
its build was present and registering subsequent builds took a write
lock, it was very easy to end up in a lock contention scenario when
starting many builds at the same time. Then it was easy to hit the 3
second timeout especially when the machine itself was under load.

This changes the notification mechanism so the notify happens per build.
Looking up a build id creates a forwarder registrar with a channel that
can be polled for when the registration is complete. A forwarder will
then only be notified and woken when that specific build id is ready by
the go runtime rather than from the sync condition.

Signed-off-by: Jonathan A. Sternberg <jonathan.sternberg@docker.com>
@jsternberg jsternberg force-pushed the forwarding-ping-no-job-id branch from 686d666 to 4b9488b Compare May 6, 2026 19:21
@jsternberg
Copy link
Copy Markdown
Collaborator Author

Thanks for the clarification. I've broken out the logic into its own package.

select {
case <-reg.notifyCh:
return
case <-time.After(3 * time.Second):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be just passed via Get(ctx) with context.WithTimeout()?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is generally needed anymore. I think the time.After() ends up being easier to track and also automatically cleans itself up. I also chose to make it so the timer only gets started if the Get call is the reason why the registration is created. If Register happens first, no timer gets created. I chose the outer section (the part that doesn't run in a goroutine) to only consider the passed in context just in case the grpc call got canceled but the timeout is only contained in the spawned goroutine and only starts after the registrar is created and is waiting. This also prevents the timer from inadvertently waiting on a busy global lock.

@tonistiigi tonistiigi merged commit 5dc04eb into moby:master May 6, 2026
239 of 241 checks passed
@jsternberg jsternberg deleted the forwarding-ping-no-job-id branch May 6, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants