Event loop got stuck when requests timed out #1088

leplatrem · 2022-09-22T11:01:55Z

We apparently had an issue with the name service in the Kubernetes cluster for Telescope. The name service returned stale IP addresses for settings.prod.mozaws.net inside the Telescope pod, which somehow caused the whole service to stall and not respond to requests anymore. We could fix it after a lot of debugging by getting the name service fixed, but Telescope shouldn't lock up just because it times out on a URL. It kind of looks like it was doing blocking calls to establish connections, eventually stalling the event loop because all threads were blocking.

Here is one of the error we saw in Sentry: https://sentry.prod.mozaws.net/operations/poucave-prod/issues/18284097/ (edited)

The traceback doesn't really look like the code should be blocking, but Telescope was completely unreachable, so something must have been blocking.

leplatrem · 2022-10-28T15:10:46Z

I tried to simulate a slow server with:

docker run --publish 0.0.0.0:8888:8888 alpine nc -lk -p 8888 -e sleep 60s

(source)

Using:

[checks.test.hb]
description = ""
module = "checks.core.heartbeat"
params.url = "http://localhost:8888"

I obtain consistent timeout exceptions, which are retried via backoff.

Next step would consist in identifying whether the event loop is actually blocked during the connection or not. Or whether the issue comes from the only thread executor being overloaded when many calls are timeout'ing

leplatrem added the bug Something isn't working label Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event loop got stuck when requests timed out #1088

Event loop got stuck when requests timed out #1088

leplatrem commented Sep 22, 2022

leplatrem commented Oct 28, 2022

Event loop got stuck when requests timed out #1088

Event loop got stuck when requests timed out #1088

Comments

leplatrem commented Sep 22, 2022

leplatrem commented Oct 28, 2022