gateway: reduce potential lock contention in gateway forwarder#6741
Conversation
d8f1eb7 to
686d666
Compare
tonistiigi
left a comment
There was a problem hiding this comment.
I think this should be done with a generics-based utility.
|
@tonistiigi any ideas on a good name for the part you want me to split out? I'm attempting to look at splitting it out but I'd be removing almost the entirety of the gateway forwarder. Maybe we can defer splitting this out until it's needed somewhere else? |
|
@jsternberg Smth like
I think it would still be much cleaner with these separation, but if you want, you can leave the generic mechanism private for now instead of adding public pkg for it(although I think the session registration is probably a very similar mechanism that we could look in a follow-up). |
There's a large potential for a lock contention issue in the gateway forwarder's logic. The previous iteration of this would keep a global mapping of the build ids and, when a forwarder for a build id didn't exist, the forwarder would wait 3 seconds for the build to register. The issue with lock contention comes after this. Instead of having a notification channel that a specific build was ready, the forwarder would wake up all goroutines that were waiting each time a build was registered. Since each of those builds took a read lock to check whether its build was present and registering subsequent builds took a write lock, it was very easy to end up in a lock contention scenario when starting many builds at the same time. Then it was easy to hit the 3 second timeout especially when the machine itself was under load. This changes the notification mechanism so the notify happens per build. Looking up a build id creates a forwarder registrar with a channel that can be polled for when the registration is complete. A forwarder will then only be notified and woken when that specific build id is ready by the go runtime rather than from the sync condition. Signed-off-by: Jonathan A. Sternberg <jonathan.sternberg@docker.com>
686d666 to
4b9488b
Compare
|
Thanks for the clarification. I've broken out the logic into its own package. |
| select { | ||
| case <-reg.notifyCh: | ||
| return | ||
| case <-time.After(3 * time.Second): |
There was a problem hiding this comment.
Shouldn't this be just passed via Get(ctx) with context.WithTimeout()?
There was a problem hiding this comment.
I don't think this is generally needed anymore. I think the time.After() ends up being easier to track and also automatically cleans itself up. I also chose to make it so the timer only gets started if the Get call is the reason why the registration is created. If Register happens first, no timer gets created. I chose the outer section (the part that doesn't run in a goroutine) to only consider the passed in context just in case the grpc call got canceled but the timeout is only contained in the spawned goroutine and only starts after the registrar is created and is waiting. This also prevents the timer from inadvertently waiting on a busy global lock.
There's a large potential for a lock contention issue in the gateway
forwarder's logic. The previous iteration of this would keep a global
mapping of the build ids and, when a forwarder for a build id didn't
exist, the forwarder would wait 3 seconds for the build to register.
The issue with lock contention comes after this. Instead of having a
notification channel that a specific build was ready, the forwarder
would wake up all goroutines that were waiting each time a build was
registered. Since each of those builds took a read lock to check whether
its build was present and registering subsequent builds took a write
lock, it was very easy to end up in a lock contention scenario when
starting many builds at the same time. Then it was easy to hit the 3
second timeout especially when the machine itself was under load.
This changes the notification mechanism so the notify happens per build.
Looking up a build id creates a forwarder registrar with a channel that
can be polled for when the registration is complete. A forwarder will
then only be notified and woken when that specific build id is ready by
the go runtime rather than from the sync condition.
Potentially alleviates how often #5171 will happen.