-
Notifications
You must be signed in to change notification settings - Fork 92
Double-check runner status when it is free with different API using GHA API runner id #6564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool to see that this API has a new use case!
` it should not be busy, but the flag dontTrustIdleFromList is set to true, so we will not trust it ` + | ||
`and try to grab it directly if we have GHA quotas for it.`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the intention clearer...
` it should not be busy, but the flag dontTrustIdleFromList is set to true, so we will not trust it ` + | |
`and try to grab it directly if we have GHA quotas for it.`, | |
`The cached GitHub list runners api said it not busy, but the flag dontTrustIdleFromList is set to true, so we will get a fresh ` + | |
`status from GitHub in case the value has become stale` |
export async function getGHRunnerOrg( | ||
ec2runner: RunnerInfo, | ||
metrics: ScaleDownMetrics, | ||
dontTrustIdleFromList = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a friendlier name :)
dontTrustIdleFromList = true, | |
verifyCachedIdleStatus = true, |
ghRunner = await getRunnerOrg(ec2runner.org as string, ec2runner.ghRunnerId, metrics); | ||
const ghLimitInfo = await getGitHubRateLimit({ owner: org, repo: '' }, metrics); | ||
metrics.gitHubRateLimitStats(ghLimitInfo.limit, ghLimitInfo.remaining, ghLimitInfo.used); | ||
if (ghLimitInfo.remaining > ghLimitInfo.limit * 0.4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you please extract the 0.4 into a constant so that it's not a magic number in the code?
} else { | ||
console.warn( | ||
`Runner '${ec2runner.instanceId}' [${ec2runner.runnerType}](${org}) - We DON'T have enough GHA API quotas` + | ||
` to call the API and double-check runner status by grabbing it directly. Assuming it is busy. Remaning: ` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it would be better to have the fall back behavior be to trust the cached data. Otherwise we risk having our costs go through the roof if we get rate limited by github.
If we fall back to trusting the cache, we would basically be replicating today's behavior in the case of a rate limit shortage, which is fine since normally the github api is supposed to have our back here and protect against erroneous runner shutdowns anyways.
if (ghRunner === undefined) { | ||
if (ec2runner.ghRunnerId === undefined) { | ||
console.warn( | ||
`Runner '${ec2runner.instanceId}' [${ec2runner.runnerType}](${org}) was neither found in ` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding these!
` to call the API and double-check runner status by grabbing it directly. ` + | ||
`Remaning: ${ghLimitInfo.remaining} / Limit: ${ghLimitInfo.limit} / Used: ${ghLimitInfo.used}`, | ||
); | ||
safeToCallGHApi = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider renaming to something like "verifyCachedIdleStatus" given the behavior change I'm suggesting
const ghRunnerDirect = await getRunnerOrg(org, ec2runner.ghRunnerId, metrics); | ||
if (ghRunnerDirect !== undefined) { | ||
ghRunner = ghRunnerDirect; | ||
console.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, given that based on the current status of the investigation this change may not have affected the sev (since github itself seems to have been returning invalid data), I'm thinking let's hold off on this PR for now.
After we hear back from GH and get better clarity on the root cause, we may find a better path forward
We recently had a CI:SEV in our infra due to what we suspect to be outdated information present in gha api.
During the discussion we evaluated the very small risk of having slightly outdated information (< 50 seconds) for the status of some runners and how this could potentially cause in rare edge cases the termination of busy workers. This was not the cause of the issue we experienced, but the edge case bug exists and during the discussion we concluded that this change could potentially be another thing that might prevent similar issues in the future.
This change introduces the following behaviour change for scaleDown:
Before terminate a runner, the ones that are free are double-checked by performing a GHA API request by runner id. This is ignored in case we're running low in GHA API quotas. If we can't perform the request and double-check, we assume the runner to be busy.
A quick analysis of the numbers concludes that we're probably OK if we use up to 75% of the quota for this check (what would be very unlikely to happen). We decided to play safe and consider a 60% margin just in case.