-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in sotw v3 server #503
Comments
Probably worth adding, that we have 19 different |
Not sure I'm seeing all the problems here yet, but |
One possible fix is to drop the cache lock when sending responses. While holding the lock in SetSnapshot, generate all the watch responses into a slice, then drop the lock and send them all. That's simpler than trying to fix the channel management in the server IMHO (though that would be a more general fix). Looking at SetSnapshot, it *mostly doesn't need to hold the write mutex (except where it does). It only needs the write mutex when it adds the snapshot to the cache, and when it calls ContstructVersionMap. The first case is easy - we only need to hold the lock while we set the map entry. The second case we can refactor away, but that's a larger change. Unfortunately, to fully fix the deadlock with this approach all calls to |
Since canceling a watch only modified the corresponding statusInfo, we don't need to hold a write lock on the snapshot cache. Holding a read lock is sufficient to ensure that the status entry is stable. This updates envoyproxy#503. Signed-off-by: James Peach <[email protected]>
Holding a read lock is sufficient to ensure that the status entry is stable when cancelling watches This updates #503. Signed-off-by: James Peach <[email protected]>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
not stale |
Hi @jpeach, I just came into the same problem. I also wonder that overriding each |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
The problem occurs in our testing after the update to the latest go-control-plane version, but I believe it might happen in real-world use cases.
Stack trace for goroutines looks like this:
While the first goroutine tries to call
SetSnapshot
and update all watchers link, the server goroutine receives DiscoveryRequest and tries to call cancel. BothSetSnapshot
andcancel
callcache.mu.Lock()
.The first goroutine in
SetSnapshot
can't update watchers, because thevalues.responses
channel is full it has capacity 5 so it blocks while the server goroutine won't read something from this channel. But the server goroutine can't read something fromvalues.responses
because it's incancel
and waits whilecache.mu
lock will be unlocked.cc: @jpeach
The text was updated successfully, but these errors were encountered: