Double-check runner status when it is free with different API using GHA API runner id #6564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jeanschmidt wants to merge 2 commits into main from jeanschmidt/double_check_terminate_runner

Contributor

jeanschmidt commented Apr 24, 2025 •

edited

Loading

We recently had a CI:SEV in our infra due to what we suspect to be outdated information present in gha api.

During the discussion we evaluated the very small risk of having slightly outdated information (< 50 seconds) for the status of some runners and how this could potentially cause in rare edge cases the termination of busy workers. This was not the cause of the issue we experienced, but the edge case bug exists and during the discussion we concluded that this change could potentially be another thing that might prevent similar issues in the future.

This change introduces the following behaviour change for scaleDown:

Before terminate a runner, the ones that are free are double-checked by performing a GHA API request by runner id. This is ignored in case we're running low in GHA API quotas. If we can't perform the request and double-check, we assume the runner to be busy.

A quick analysis of the numbers concludes that we're probably OK if we use up to 75% of the quota for this check (what would be very unlikely to happen). We decided to play safe and consider a 60% margin just in case.


          Double-check runner sattus when it is free with different API using G…

8e8dbb1

…HA API runner id

vercel bot commented Apr 24, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Apr 24, 2025 2:29pm

facebook-github-bot added the CLA Signed label


          Tests and lint

9b9f2fa

zxiiro approved these changes

View reviewed changes

Collaborator

zxiiro left a comment

Cool to see that this API has a new use case!

ZainRizvi requested changes

View reviewed changes

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

Comment on lines +339 to +340

		` it should not be busy, but the flag dontTrustIdleFromList is set to true, so we will not trust it ` +
		`and try to grab it directly if we have GHA quotas for it.`,

Contributor

ZainRizvi Apr 24, 2025

Making the intention clearer...

Suggested change

      
                    ` it should not be busy, but the flag dontTrustIdleFromList is set to true, so we will not trust it ` +
          
                    `and try to grab it directly if we have GHA quotas for it.`,
          
                    `The cached GitHub list runners api said it not busy, but the flag dontTrustIdleFromList is set to true, so we will get a fresh  ` +
          
                    `status from GitHub in case the value has become stale`

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

+              export async function getGHRunnerOrg(
+                ec2runner: RunnerInfo,
+                metrics: ScaleDownMetrics,
+                dontTrustIdleFromList = true,

Contributor

ZainRizvi Apr 24, 2025

Perhaps a friendlier name :)

Suggested change

      
              dontTrustIdleFromList = true,
          
              verifyCachedIdleStatus = true,

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

-                    ghRunner = await getRunnerOrg(ec2runner.org as string, ec2runner.ghRunnerId, metrics);
+                    const ghLimitInfo = await getGitHubRateLimit({ owner: org, repo: '' }, metrics);
+                    metrics.gitHubRateLimitStats(ghLimitInfo.limit, ghLimitInfo.remaining, ghLimitInfo.used);
+                    if (ghLimitInfo.remaining > ghLimitInfo.limit * 0.4) {

Contributor

ZainRizvi Apr 24, 2025

nit: can you please extract the 0.4 into a constant so that it's not a magic number in the code?

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

+                    } else {
+                      console.warn(
+                        `Runner '${ec2runner.instanceId}' [${ec2runner.runnerType}](${org}) - We DON'T have enough GHA API quotas` +
+                          ` to call the API and double-check runner status by grabbing it directly. Assuming it is busy. Remaning: ` +

Contributor

ZainRizvi Apr 24, 2025

I'm thinking it would be better to have the fall back behavior be to trust the cached data. Otherwise we risk having our costs go through the roof if we get rate limited by github.

If we fall back to trusting the cache, we would basically be replicating today's behavior in the case of a rate limit shortage, which is fine since normally the github api is supposed to have our back here and protect against erroneous runner shutdowns anyways.

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

+                if (ghRunner === undefined) {
+                  if (ec2runner.ghRunnerId === undefined) {
+                    console.warn(
+                      `Runner '${ec2runner.instanceId}' [${ec2runner.runnerType}](${org}) was neither found in ` +

Contributor

ZainRizvi Apr 24, 2025

thanks for adding these!

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

+                          ` to call the API and double-check runner status by grabbing it directly. ` +
+                          `Remaning: ${ghLimitInfo.remaining} / Limit: ${ghLimitInfo.limit} / Used: ${ghLimitInfo.used}`,
+                      );
+                      safeToCallGHApi = true;

Contributor

ZainRizvi Apr 24, 2025

nit: consider renaming to something like "verifyCachedIdleStatus" given the behavior change I'm suggesting

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.ts

+                      const ghRunnerDirect = await getRunnerOrg(org, ec2runner.ghRunnerId, metrics);
+                      if (ghRunnerDirect !== undefined) {
+                        ghRunner = ghRunnerDirect;
+                        console.warn(

Contributor

ZainRizvi Apr 24, 2025

why not debug?

ZainRizvi reviewed

View reviewed changes

Contributor

ZainRizvi left a comment

At a high level, given that based on the current status of the investigation this change may not have affected the sev (since github itself seems to have been returning invalid data), I'm thinking let's hold off on this PR for now.

After we hear back from GH and get better clarity on the root cause, we may find a better path forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels