Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event loop got stuck when requests timed out #1088

Open
leplatrem opened this issue Sep 22, 2022 · 1 comment
Open

Event loop got stuck when requests timed out #1088

leplatrem opened this issue Sep 22, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@leplatrem
Copy link
Contributor

From Slack

We apparently had an issue with the name service in the Kubernetes cluster for Telescope. The name service returned stale IP addresses for settings.prod.mozaws.net inside the Telescope pod, which somehow caused the whole service to stall and not respond to requests anymore. We could fix it after a lot of debugging by getting the name service fixed, but Telescope shouldn't lock up just because it times out on a URL. It kind of looks like it was doing blocking calls to establish connections, eventually stalling the event loop because all threads were blocking.

Here is one of the error we saw in Sentry: https://sentry.prod.mozaws.net/operations/poucave-prod/issues/18284097/ (edited)

The traceback doesn't really look like the code should be blocking, but Telescope was completely unreachable, so something must have been blocking.

@leplatrem leplatrem added the bug Something isn't working label Sep 22, 2022
@leplatrem
Copy link
Contributor Author

I tried to simulate a slow server with:

docker run --publish 0.0.0.0:8888:8888 alpine nc -lk -p 8888 -e sleep 60s

(source)

Using:

[checks.test.hb]
description = ""
module = "checks.core.heartbeat"
params.url = "http://localhost:8888"

I obtain consistent timeout exceptions, which are retried via backoff.

Next step would consist in identifying whether the event loop is actually blocked during the connection or not. Or whether the issue comes from the only thread executor being overloaded when many calls are timeout'ing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant