Skip to content

Git Job Error Handling #6517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 10, 2025
Merged

Git Job Error Handling #6517

merged 8 commits into from
Apr 10, 2025

Conversation

yangw-dev
Copy link
Contributor

@yangw-dev yangw-dev commented Apr 8, 2025

Overview

Add logic to hand Git Job Handle:
the handling only triggers when a configuration for a repo exists

  1. find rows that represents the git job level failure
  2. based on config, detect all related rows, and add Failure metric
  3. skip failure job rows
    The ui renders it as normal failure handling

Demo

the demo pr: #6516
mimic GIt_Job Failure in with
model: "edsr", backend: "qnn_q8", mode: "inference", device: "samsung_galaxy_s22"

demo link with fake data in vercel

UI snapshot

image

Others [BE]

  • wrap group logics for lcommit and rcommit into a method
  • wrap toTableRow method into a method

Copy link

vercel bot commented Apr 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview Apr 9, 2025 10:49pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 8, 2025
@yangw-dev yangw-dev changed the base branch from main to adFailure April 8, 2025 22:47
@yangw-dev yangw-dev changed the base branch from adFailure to main April 8, 2025 22:57
@yangw-dev yangw-dev changed the base branch from main to adFailure April 8, 2025 22:59
@yangw-dev yangw-dev marked this pull request as ready for review April 8, 2025 23:53
@yangw-dev yangw-dev requested a review from guangy10 April 8, 2025 23:53
},
highlight: hasL && hasR,
};
// TODO (huydhn): Fix the passing of tensor_parallel_size to the benchmark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I haven't removed this yet. Let me clean this up

export function computeGeomean(data: LLMsBenchmarkData[], metricName: string) {
const metricValues: { [key: string]: number[] } = {};
const returnedGeomean: LLMsBenchmarkData[] = [];
const processJobLevelFailureRows = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this processJobLevelFailureRows could be done in https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/oss_ci_benchmark_llms/query.sql by returning 2 new columns failure_type and failure_report? If it's doable without blowing up the complexity of the SQL query, it seems like an easier way to implement this change as SQL syntax is more concise

Copy link
Contributor Author

@yangw-dev yangw-dev Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, actually i modified the query to include failure_type. added
But this is still needed, this basically maps the GIT_JOB failure row into multiple rows

if (!val) {
return false;
}
if (jobLevelFailureConfig["content"].includes(val) && isJobLevelFailure) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, do you know that is the default value returning by filter here? I'm trying to skim through https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter but haven't found it yet. I'm trying to figure out the behavior when a new device is added into the list:

apple_iphone_15,
samsung_galaxy_s22,
samsung_galaxy_s24,
google_pixel_8_pro,

where jobLevelFailureConfig["content"].includes(val) would return false.

job_level_failure: {
key_name: "device",
content: [
"apple_iphone_15",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn
you mean default value here?
We may need to maintain the config here

@yangw-dev yangw-dev merged commit d82e073 into adFailure Apr 10, 2025
5 of 6 checks passed
@yangw-dev yangw-dev deleted the fomralJobFailure branch April 10, 2025 17:44
@yangw-dev yangw-dev restored the fomralJobFailure branch April 10, 2025 17:45
yangw-dev added a commit that referenced this pull request Apr 10, 2025
@yangw-dev yangw-dev mentioned this pull request Apr 10, 2025
yangw-dev added a commit that referenced this pull request Apr 11, 2025
accidently merged it into the depended branch this is same as 

Fix bug for the failure handling
#6517
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants