[WIP] Update benchmark data #643

Tcc0403 · 2025-04-02T10:22:39Z

Summary

Rerun all benchmarks scripts to get the latest data, so we can have a reliable baseline for future optimization.

Note: orpo failing with compile=True (plotting with old data for now), qwen2vl_mrope script failed.

A complete comparison figure will be uploaded in this PR later.

Fused Linear Chunked Loss

Alignment

Distillation

JSD
speed

Others

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 · 2025-04-07T12:54:33Z

@shivam15s @lancerts @yundai424
I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

yundai424 · 2025-04-07T18:03:31Z

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

lancerts · 2025-04-07T18:21:42Z

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Strong +1, which can also help detect performance regression early.

lancerts · 2025-04-07T18:23:16Z

@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?

For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?

This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

1 I don't think we need to keep the old data.
2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release.
3 I think we still need the A100 data in the near future.

Tcc0403 · 2025-04-08T12:44:22Z

@yundai424 @lancerts

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Totally agree! An official benchmark result is defintely better.

1 I don't think we need to keep the old data.
2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release.
3 I think we still need the A100 data in the near future.

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

Is it possible to setup a scheduled ci to periodically udpate the nightly benchmark?

If so, instead of the current all_benchmakr_data, we can create two benchmark data files, one for version release (full benchmark) and the other for nightly (simple benchmark). The release one keeps a complete benchmark result in the latest version as the current one does. The nightly one can hold multiple recent results (10-20 commits or weeks/months), but only with the most representative config, e.g., batch_size, seq_len, hidden_size, vocab_size of llama. In this way, we can set x-axis to date and visualize it for readibility. Best case scenario, we can plot it in online/offline docs.

yundai424 · 2025-04-08T16:42:07Z

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

agree 🤔 ideally something like https://hud.pytorch.org/benchmark/compilers and host the results somewhere else on server so we don't flush git history with bunch of benchmark numbers..

Tcc0403 and others added 5 commits April 2, 2025 06:07

Update benchmark data

7718018

Move instantiation out of measurement

21d0383

Signed-off-by: Tcc0403 <[email protected]>

Modify visualizer to plot the latest data only

702089b

Signed-off-by: Tcc0403 <[email protected]>

Update benchmark data

32f4547

Fix recursion call in lambda function

1d0203a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Update benchmark data #643

[WIP] Update benchmark data #643

Tcc0403 commented Apr 2, 2025 •

edited

Loading

Tcc0403 commented Apr 7, 2025

yundai424 commented Apr 7, 2025

lancerts commented Apr 7, 2025

lancerts commented Apr 7, 2025

Tcc0403 commented Apr 8, 2025

yundai424 commented Apr 8, 2025 •

edited

Loading

[WIP] Update benchmark data #643

Are you sure you want to change the base?

[WIP] Update benchmark data #643

Conversation

Tcc0403 commented Apr 2, 2025 • edited Loading

Summary

Fused Linear Chunked Loss

Alignment

Distillation

Others

Testing Done

Tcc0403 commented Apr 7, 2025

yundai424 commented Apr 7, 2025

lancerts commented Apr 7, 2025

lancerts commented Apr 7, 2025

Tcc0403 commented Apr 8, 2025

yundai424 commented Apr 8, 2025 • edited Loading

Tcc0403 commented Apr 2, 2025 •

edited

Loading

yundai424 commented Apr 8, 2025 •

edited

Loading