Deadlock in sotw v3 server #503

lobkovilya · 2021-10-20T09:37:59Z

The problem occurs in our testing after the update to the latest go-control-plane version, but I believe it might happen in real-world use cases.

Stack trace for goroutines looks like this:

1 @ 0x103ca25 0x100759a 0x10072f5 0x1bb58af 0x1bb4507 0x1bbcdb9 0x1ea2695 0x1bb3ec6 0x1074121
#	0x1bb58ae	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).respond+0x12e		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:311
#	0x1bb4506	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).SetSnapshot+0x5a6	/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:168
#	0x1bbcdb8	github.com/kumahq/kuma/pkg/kds/reconcile.(*reconciler).Reconcile+0x1f8		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/reconcile/reconciler.go:46
#	0x1ea2694	github.com/kumahq/kuma/pkg/kds/server.newSyncTracker.func1.2+0x194		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/server/components.go:93
#	0x1bb3ec5	github.com/kumahq/kuma/pkg/util/watchdog.(*SimpleWatchdog).Start+0xe5		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/watchdog/watchdog.go:25

...

1 @ 0x103ca25 0x104e6c5 0x104e6ae 0x106fd67 0x107f225 0x1080990 0x1080922 0x1bb8a06 0x1bab955 0x1bad89a 0x1e9fecb 0x284e9b3 0x1074121
#	0x106fd66	sync.runtime_SemacquireMutex+0x46							/usr/local/Cellar/go/1.16.5/libexec/src/runtime/sema.go:71
#	0x107f224	sync.(*Mutex).lockSlow+0x104								/usr/local/Cellar/go/1.16.5/libexec/src/sync/mutex.go:138
#	0x108098f	sync.(*Mutex).Lock+0x8f									/usr/local/Cellar/go/1.16.5/libexec/src/sync/mutex.go:81
#	0x1080921	sync.(*RWMutex).Lock+0x21								/usr/local/Cellar/go/1.16.5/libexec/src/sync/rwmutex.go:111
#	0x1bb8a05	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).cancelWatch.func1+0x65		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:283
#	0x1bab954	github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3.(*server).process+0x7b4	/Users/lobkovilya/go/src/github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3/server.go:418
#	0x1bad899	github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3.(*server).StreamHandler+0xb9	/Users/lobkovilya/go/src/github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3/server.go:449
#	0x1e9feca	github.com/kumahq/kuma/pkg/kds/server.(*server).StreamKumaResources+0x8a		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/server/kds.go:30
#	0x284e9b2	github.com/kumahq/kuma/pkg/test/kds/setup.StartServer.func1+0x72			/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/test/kds/setup/server.go:60

While the first goroutine tries to call SetSnapshot and update all watchers link, the server goroutine receives DiscoveryRequest and tries to call cancel. Both SetSnapshot and cancel call cache.mu.Lock().

The first goroutine in SetSnapshot can't update watchers, because the values.responses channel is full it has capacity 5 so it blocks while the server goroutine won't read something from this channel. But the server goroutine can't read something from values.responses because it's in cancel and waits while cache.mu lock will be unlocked.

cc: @jpeach

The text was updated successfully, but these errors were encountered:

lobkovilya · 2021-10-20T12:54:29Z

Probably worth adding, that we have 19 different typeURL and that's why values.responses channel with capacity 5 can't guarantee SetSnapshot won't block.

jpeach · 2021-10-21T01:23:31Z

Not sure I'm seeing all the problems here yet, but cancelWatch holds too many locks. It only needs a read lock on the cache, then a write lock to modify the statusInfo to remove the watch. This doesn't fix the deadlock BTW.

jpeach · 2021-10-21T04:17:00Z

This doesn't fix the deadlock BTW.

One possible fix is to drop the cache lock when sending responses. While holding the lock in SetSnapshot, generate all the watch responses into a slice, then drop the lock and send them all. That's simpler than trying to fix the channel management in the server IMHO (though that would be a more general fix).

Looking at SetSnapshot, it *mostly doesn't need to hold the write mutex (except where it does). It only needs the write mutex when it adds the snapshot to the cache, and when it calls ContstructVersionMap. The first case is easy - we only need to hold the lock while we set the map entry. The second case we can refactor away, but that's a larger change.

Unfortunately, to fully fix the deadlock with this approach all calls to cache.respond and cache.respondDelta need to be changed, which is a number of places.

Since canceling a watch only modified the corresponding statusInfo, we don't need to hold a write lock on the snapshot cache. Holding a read lock is sufficient to ensure that the status entry is stable. This updates envoyproxy#503. Signed-off-by: James Peach <[email protected]>

Holding a read lock is sufficient to ensure that the status entry is stable when cancelling watches This updates #503. Signed-off-by: James Peach <[email protected]>

github-actions · 2021-11-20T08:08:03Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

jpeach · 2021-11-22T03:55:14Z

not stale

rueian · 2021-12-20T09:31:42Z

Hi @jpeach, I just came into the same problem.
I think it would be better if the sotw server handles the <-reqCh case (https://github.com/envoyproxy/go-control-plane/blob/main/pkg/server/sotw/v3/server.go#L321) in another goroutine. Is it possible?

I also wonder that overriding each chan cache.Response once <-reqCh received is not necessary if the <-reqCh case handled in another goroutine

jpeach · 2021-12-20T22:01:20Z

@rueian Pretty sure that #451 will fix the deadlock.

rueian · 2021-12-21T07:34:01Z

Hi @jpeach, Thank you. The PR 451 looks great.

However I found there is another issue that might causing deadlock. I opened a new issue here #530

github-actions · 2022-01-20T08:08:24Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2022-01-27T12:06:16Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

This was referenced Oct 21, 2021

cache: hold a read lock while canceling a snapshot watch. #507

Merged

Move the cache watch API from channels to interfaces #509

Open

github-actions bot added the stale label Nov 20, 2021

github-actions bot removed the stale label Nov 22, 2021

github-actions bot added the stale label Jan 20, 2022

github-actions bot closed this as completed Jan 27, 2022

michaelbeaumont mentioned this issue Jul 29, 2022

Move to upstream go-control-plane from fork kumahq/kuma#2719

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in sotw v3 server #503

Deadlock in sotw v3 server #503

lobkovilya commented Oct 20, 2021

lobkovilya commented Oct 20, 2021

jpeach commented Oct 21, 2021 •

edited

Loading

jpeach commented Oct 21, 2021

github-actions bot commented Nov 20, 2021

jpeach commented Nov 22, 2021

rueian commented Dec 20, 2021 •

edited

Loading

jpeach commented Dec 20, 2021

rueian commented Dec 21, 2021

github-actions bot commented Jan 20, 2022

github-actions bot commented Jan 27, 2022

Deadlock in sotw v3 server #503

Deadlock in sotw v3 server #503

Comments

lobkovilya commented Oct 20, 2021

lobkovilya commented Oct 20, 2021

jpeach commented Oct 21, 2021 • edited Loading

jpeach commented Oct 21, 2021

github-actions bot commented Nov 20, 2021

jpeach commented Nov 22, 2021

rueian commented Dec 20, 2021 • edited Loading

jpeach commented Dec 20, 2021

rueian commented Dec 21, 2021

github-actions bot commented Jan 20, 2022

github-actions bot commented Jan 27, 2022

jpeach commented Oct 21, 2021 •

edited

Loading

rueian commented Dec 20, 2021 •

edited

Loading