Skip to content

[WIP] Update benchmark data #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

[WIP] Update benchmark data #643

wants to merge 5 commits into from

Conversation

Tcc0403
Copy link
Collaborator

@Tcc0403 Tcc0403 commented Apr 2, 2025

Summary

Rerun all benchmarks scripts to get the latest data, so we can have a reliable baseline for future optimization.

Note: orpo failing with compile=True (plotting with old data for now), qwen2vl_mrope script failed.

A complete comparison figure will be uploaded in this PR later.

Fused Linear Chunked Loss

Alignment

  • CPO

    speedfused_linear_cpo_loss_speed

  • DPO

    speeddpo_loss_speed

  • KTO

    speedkto_loss_speed

  • ORPO

    speedfused_linear_orpo_loss_speed

  • SimPO

    speedfused_linear_simpo_loss_speed

Distillation

  • JSD
    speeddistill_jsd_loss_speed

Others

  • Cross Entropy

    speedcross_entropy_speed

  • Fused Linear Cross Entropy

    speedfused_linear_cross_entropy_speed

  • JSD

    speedjsd_speed

  • Fused Linear JSD

    speed

  • DyT

    speeddyt_speed

  • Embedding

    speedembedding_speed

  • GeGLU

    speedgeglu_speed

  • GroupNorm

    speedgroup_norm_speed

  • KL Div

    speedkl_div_speed

  • LayerNorm

    speedlayer_norm_speed

  • RMSNorm

    speedrms_norm_speed

  • RoPE

    speedrope_speed

  • Swiglu

    speed

  • TVD

    speedtvd_speed

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@Tcc0403
Copy link
Collaborator Author

Tcc0403 commented Apr 7, 2025

@shivam15s @lancerts @yundai424
I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

  1. Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
  2. For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
  3. This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

@yundai424
Copy link
Collaborator

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

@lancerts
Copy link
Collaborator

lancerts commented Apr 7, 2025

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Strong +1, which can also help detect performance regression early.

@lancerts
Copy link
Collaborator

lancerts commented Apr 7, 2025

@shivam15s @lancerts @yundai424 I'm trying to refactor the benchmark visualizer and utils for storing data and there are few questions I want to figure out first:

  1. Some data are quite outdated, do we need to keep old versions data (< v0.5.0)?
  2. For future benchmarking, do we keep the latest data only (overwrite data from older versions)? Or do we want to keep track of them for performance comarison over time?
  3. This PR only updates H100 data for now, do we need the latest A100 benchmark data as well?

1 I don't think we need to keep the old data.
2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release.
3 I think we still need the A100 data in the near future.

@Tcc0403
Copy link
Collaborator Author

Tcc0403 commented Apr 8, 2025

@yundai424 @lancerts

Perhaps we can do an official benchmark whenever a new version is released. Along with the PR that bumps the version in pyproject.toml, we can add the latest benchmark result -- this way we can let git history to help us keep track of the performance 😄 would like to hear your opinion.

Totally agree! An official benchmark result is defintely better.

1 I don't think we need to keep the old data.
2 Keep the latest data should be enough and we can have git help us track. We should guardrail the performance regression for each release.
3 I think we still need the A100 data in the near future.

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

Is it possible to setup a scheduled ci to periodically udpate the nightly benchmark?

If so, instead of the current all_benchmakr_data, we can create two benchmark data files, one for version release (full benchmark) and the other for nightly (simple benchmark). The release one keeps a complete benchmark result in the latest version as the current one does. The nightly one can hold multiple recent results (10-20 commits or weeks/months), but only with the most representative config, e.g., batch_size, seq_len, hidden_size, vocab_size of llama. In this way, we can set x-axis to date and visualize it for readibility. Best case scenario, we can plot it in online/offline docs.

@yundai424
Copy link
Collaborator

yundai424 commented Apr 8, 2025

Besides the benchmark along with new releases, I think it would be great to have additional benchmark for nightly (or do it weekly), so we can detect performance regression earlier and handle it before version bump.

agree 🤔 ideally something like https://hud.pytorch.org/benchmark/compilers and host the results somewhere else on server so we don't flush git history with bunch of benchmark numbers..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants