Skip to content

[Queue Time Histogram] Add Job Queue Time Lambda #6435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Mar 19, 2025
Merged

Conversation

yangw-dev
Copy link
Contributor

@yangw-dev yangw-dev commented Mar 18, 2025

Description

Add aws lambda to generate in-queue job histgram, steps:

  1. add logic to generate snapshot of in-queue jobs [This PR]
  2. add logic to generate histogram
  3. suppport multi-threading for backfilling

the snapshot data we generate includes:
{ queue_s, repo, workflow_name , job_name, htm_url, machine_type, time, runner_labels}

the runner_labels includes the machine_type, and other categories such as linux, dynamic etc

Design Doc:doc

working result in s3 (Ran locally)

s3 link

Copy link

vercel bot commented Mar 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Mar 19, 2025 11:11pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2025
@yangw-dev yangw-dev changed the title add time [queue time] Add Job Queue Time Lambda Mar 18, 2025
@yangw-dev yangw-dev marked this pull request as ready for review March 18, 2025 04:11
Copy link
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to also create some sort of breakdowns/aggregations?

Like Ephemeral vs NonEphemeal, meta-owned/non-meta-owned, pet vs dynamic, Linux/Mac/Windows, etc?

@yangw-dev yangw-dev requested a review from clee2000 March 18, 2025 20:54
@yangw-dev
Copy link
Contributor Author

yangw-dev commented Mar 18, 2025

Are you able to also create some sort of breakdowns/aggregations?

Like Ephemeral vs NonEphemeal, meta-owned/non-meta-owned, pet vs dynamic, Linux/Mac/Windows, etc?

added, I copied some of the File handling code from ci-pct.py

@yangw-dev
Copy link
Contributor Author

yangw-dev commented Mar 19, 2025

Are you able to also create some sort of breakdowns/aggregations?

Like Ephemeral vs NonEphemeal, meta-owned/non-meta-owned, pet vs dynamic, Linux/Mac/Windows, etc?

added, it will be stored in array field as runner_labels

arguments = parse_args()

# update environment variables for input parameters
os.environ["CLICKHOUSE_ENDPOINT"] = arguments.clickhouse_endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a not so great engineering standard to hide the original value by overwriting it with cli arguments.

It makes more sense to default the CLI argparse from the environment os.environ. But you can do with a special Config class if you want to as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, I can make it different!

runner_labels["other"].add(machine_type)


def create_runner_labels(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this piece of code be reused? So we can avoid having to update in multiple places?

Just today Nikita is planning to move macos-m2-15 from apple ownership to ours. So we'll need to update the metrics and aggregations...

If we create many places to update we're setting ourselves to make mistakes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lambda is a bit tricky to do it, i think I can do it in a BE pr, right now my focus is get this kick in

Copy link
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that before we move forward with more details, we need to fix a few design details. Making sure this script is idempotent is critical for any processing pipeline. Please reach out to discuss if you want to :)

Copy link
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving now, as we agreed to work on the python script in a next iteration so to avoid a too big of a PR

@yangw-dev
Copy link
Contributor Author

fixed some nits, moving to next pr

@yangw-dev yangw-dev merged commit 63bbd55 into main Mar 19, 2025
7 checks passed
@yangw-dev yangw-dev deleted the addqueuelambda branch March 19, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants