Skip to content

Double-check runner status when it is free with different API using GHA API runner id #6564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented Apr 24, 2025

We recently had a CI:SEV in our infra due to what we suspect to be outdated information present in gha api.

During the discussion we evaluated the very small risk of having slightly outdated information (< 50 seconds) for the status of some runners and how this could potentially cause in rare edge cases the termination of busy workers. This was not the cause of the issue we experienced, but the edge case bug exists and during the discussion we concluded that this change could potentially be another thing that might prevent similar issues in the future.

This change introduces the following behaviour change for scaleDown:

Before terminate a runner, the ones that are free are double-checked by performing a GHA API request by runner id. This is ignored in case we're running low in GHA API quotas. If we can't perform the request and double-check, we assume the runner to be busy.

A quick analysis of the numbers concludes that we're probably OK if we use up to 75% of the quota for this check (what would be very unlikely to happen). We decided to play safe and consider a 60% margin just in case.

Copy link

vercel bot commented Apr 24, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Apr 24, 2025 2:29pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2025
Copy link
Collaborator

@zxiiro zxiiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to see that this API has a new use case!

Comment on lines +339 to +340
` it should not be busy, but the flag dontTrustIdleFromList is set to true, so we will not trust it ` +
`and try to grab it directly if we have GHA quotas for it.`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the intention clearer...

Suggested change
` it should not be busy, but the flag dontTrustIdleFromList is set to true, so we will not trust it ` +
`and try to grab it directly if we have GHA quotas for it.`,
`The cached GitHub list runners api said it not busy, but the flag dontTrustIdleFromList is set to true, so we will get a fresh ` +
`status from GitHub in case the value has become stale`

export async function getGHRunnerOrg(
ec2runner: RunnerInfo,
metrics: ScaleDownMetrics,
dontTrustIdleFromList = true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a friendlier name :)

Suggested change
dontTrustIdleFromList = true,
verifyCachedIdleStatus = true,

ghRunner = await getRunnerOrg(ec2runner.org as string, ec2runner.ghRunnerId, metrics);
const ghLimitInfo = await getGitHubRateLimit({ owner: org, repo: '' }, metrics);
metrics.gitHubRateLimitStats(ghLimitInfo.limit, ghLimitInfo.remaining, ghLimitInfo.used);
if (ghLimitInfo.remaining > ghLimitInfo.limit * 0.4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please extract the 0.4 into a constant so that it's not a magic number in the code?

} else {
console.warn(
`Runner '${ec2runner.instanceId}' [${ec2runner.runnerType}](${org}) - We DON'T have enough GHA API quotas` +
` to call the API and double-check runner status by grabbing it directly. Assuming it is busy. Remaning: ` +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it would be better to have the fall back behavior be to trust the cached data. Otherwise we risk having our costs go through the roof if we get rate limited by github.

If we fall back to trusting the cache, we would basically be replicating today's behavior in the case of a rate limit shortage, which is fine since normally the github api is supposed to have our back here and protect against erroneous runner shutdowns anyways.

if (ghRunner === undefined) {
if (ec2runner.ghRunnerId === undefined) {
console.warn(
`Runner '${ec2runner.instanceId}' [${ec2runner.runnerType}](${org}) was neither found in ` +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding these!

` to call the API and double-check runner status by grabbing it directly. ` +
`Remaning: ${ghLimitInfo.remaining} / Limit: ${ghLimitInfo.limit} / Used: ${ghLimitInfo.used}`,
);
safeToCallGHApi = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider renaming to something like "verifyCachedIdleStatus" given the behavior change I'm suggesting

const ghRunnerDirect = await getRunnerOrg(org, ec2runner.ghRunnerId, metrics);
if (ghRunnerDirect !== undefined) {
ghRunner = ghRunnerDirect;
console.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not debug?

Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, given that based on the current status of the investigation this change may not have affected the sev (since github itself seems to have been returning invalid data), I'm thinking let's hold off on this PR for now.

After we hear back from GH and get better clarity on the root cause, we may find a better path forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants