Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time out long running requests more aggressively #10833

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chris48s
Copy link
Member

We set a hard limit on how long we'll attempt to serve a request for before we give up. At the moment this is set to 20 seconds in production. I think we should make this timeout shorter. In this PR, I propose: If we've held a connection open for 8 seconds trying to serve a request and we don't have a badge yet, its time to serve a 408 and move on.

@chris48s chris48s added the operations Hosting, monitoring, and reliability for the production badge servers label Jan 23, 2025
Copy link
Contributor

Messages
📖 ✨ Thanks for your contribution to Shields, @chris48s!

Generated by 🚫 dangerJS against 7b367de

@jNullj
Copy link
Member

jNullj commented Jan 24, 2025

I think 20 is too much, but is there a particular reason we change this from 20 to 8?

@chris48s
Copy link
Member Author

The most common place where shields badges are viewed is on GitHub. All images on GitHub are served via an instance of camo (GitHub's image proxy). Camo will only wait 4 seconds for a response from an upstream (like shields.io) before returning an error, so any badge that takes more than 4 seconds to return a response won't display for most users.

We should set our hard limit a bit longer than that. Partly because shields badges are sometimes viewed in other contexts. Partly because sometimes there's value in letting a slightly long-running request run to completion so future requests can be served from CloudFlare.

In normal circumstances, we aim for all our badges to render in under 4 seconds and the vast majority do.

One of the failure modes I am trying to prevent here is where a service that represents a large chunk of our traffic (e.g: NPM, PyPI) has a performance problem and we end up with load of open connections tied up waiting on request to their API which causes a full service outage for us (this has actually happened before).

8 is a bit of an arbitrary choice. I'd be happy setting this to any number greater than 5 and less than or equal to 10 as a next step and see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations Hosting, monitoring, and reliability for the production badge servers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants