Skip to content

Benchmark HF optimum-executorch #11450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Benchmark HF optimum-executorch #11450

wants to merge 3 commits into from

Conversation

guangy10
Copy link
Contributor

@guangy10 guangy10 commented Jun 6, 2025

Benchmark LLMs from optimum-executorch. With all the work recently happening in optimum-executorch, we are able to boost the out-of-the-box performance. Putting these models on benchmark infra to gather perf numbers and understand the remaining perf gaps between the in-house generated model via export_llama.

We are able to do apple-to-apple comparison for CPU backend by introducing quant, custom SPDA, custom KV cache to native Hugging Face models in optimum-executorch: hf_xnnpack_custom_spda_kv_cache_8da4w represents the recipe used by optimum-et, et_xnnpack_custom_spda_kv_cache_8da4w is the counterpart for etLLM.

Here are the benchmark jobs in our infra:

Note there may be failures when running optimum-et models on-device due to lack of support HF tokenizers in llama runner. I will remove packing tokenizer.json from the .zip shortly so that the benchmark apps will take optimum-et LLMs as non-GenAI models.

Copy link

pytorch-bot bot commented Jun 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11450

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2025
@guangy10 guangy10 had a problem deploying to upload-benchmark-results June 6, 2025 19:38 — with GitHub Actions Failure
@guangy10 guangy10 had a problem deploying to upload-benchmark-results June 6, 2025 19:38 — with GitHub Actions Failure
@guangy10 guangy10 had a problem deploying to upload-benchmark-results June 6, 2025 19:40 — with GitHub Actions Failure
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 19:43 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from e4718b0 to fff15c6 Compare June 6, 2025 19:46
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 19:51 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 20:37 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 20:45 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from fff15c6 to 00149f2 Compare June 6, 2025 20:46
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 20:51 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from 00149f2 to 112eb2b Compare June 6, 2025 21:14
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 21:19 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 21:38 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from 112eb2b to a38a694 Compare June 6, 2025 22:02
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from a38a694 to a0f636f Compare June 6, 2025 22:50
@guangy10 guangy10 marked this pull request as ready for review June 6, 2025 22:50
@guangy10 guangy10 changed the title Benchmark optimum-executorch Benchmark HF optimum-executorch Jun 6, 2025
@guangy10 guangy10 added the release notes: none Do not include this in the release notes label Jun 6, 2025
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 23:43 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 6, 2025 23:50 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from a0f636f to 5d6dd04 Compare June 7, 2025 01:23
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 01:25 — with GitHub Actions Inactive
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 7, 2025

@huydhn In the apple's workflow, though I have specified the python version to be "3.11", it still install the python "3.13". Then when trying to pip install executorch, it ended with no package found. That's because we only publish with python 3.10, 3.11, and 3.12. https://github.com/pytorch/executorch/actions/runs/15500604843/job/43647388676#step:9:13372

Okay, it turns out that I need to run install with ${CONDA_RUN}

@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 02:15 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 02:16 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from 8aa9c02 to b0d829a Compare June 7, 2025 04:35
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 05:40 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 05:55 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 06:00 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 7, 2025 06:19 — with GitHub Actions Inactive
@guangy10 guangy10 had a problem deploying to upload-benchmark-results June 7, 2025 08:08 — with GitHub Actions Failure
@guangy10 guangy10 had a problem deploying to upload-benchmark-results June 7, 2025 09:30 — with GitHub Actions Failure
-X \
--xnnpack-extended-ops \
-qmode 8da4w -G 32 -E 8,0 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these for llama_3_2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel Yeah, for llama_3_2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel @jackzhxng can you confirm if this is the correct config we should use to export Qwen3 via etLLM path? The perf numbers reported here doesn't make sense to me #11450 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know for qwen3. Can you compare the file sizes for the two? Also use --xnnpack-extended-ops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind. you are using the option i mentioned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the command for qwen, right? That one is the one below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one I dont see hf counterpart

@guangy10
Copy link
Contributor Author

guangy10 commented Jun 9, 2025

I'm seeing jobs hitting API Limits in AWS Device Farm. We lifted it for public AWS devices, @huydhn do we need to do same and separately for new devices in private pools? https://github.com/pytorch/executorch/actions/runs/15504512047

@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 9, 2025 17:58 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 9, 2025 18:47 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 9, 2025 18:55 — with GitHub Actions Inactive
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 9, 2025

Both benchmark jobs are finished successfully, but upon checking the benchmark_results.json, they are empty. https://github.com/pytorch/executorch/actions/runs/15540702059/job/43754028987 @kirklandsign any idea why?

And all these runs: https://github.com/pytorch/executorch/actions/runs/15543294199. I would expect those to fail due to the issues in the tokenizer support in llama runner.

@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 9, 2025 20:25 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 9, 2025 21:29 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from 6c80e04 to 8e96647 Compare June 9, 2025 23:29
@guangy10 guangy10 force-pushed the optimum_et_benchmark branch from 8e96647 to d12c6f6 Compare June 9, 2025 23:33
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 10, 2025 00:29 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 10, 2025 00:37 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 10, 2025 01:17 — with GitHub Actions Inactive
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 10, 2025

Disable passing tokenizer to the Android app will make it work for Qwen3 from both etLLM and optimum-et as showin here: https://github.com/pytorch/executorch/actions/runs/15546916104/job/43772530729.
There must be issues when integrating HF tokenizer with the Android benchmark app. cc: @kirklandsign @jackzhxng

The iOS app can not run the optimum-et generated PTE even after disabling the tokenizer. That is, run it as a regular PTE doesn't work as expected. https://github.com/pytorch/executorch/actions/runs/15540727931/job/43752973136. cc: @shoumikhin

@guangy10 guangy10 had a problem deploying to upload-benchmark-results June 10, 2025 02:15 — with GitHub Actions Failure
@guangy10
Copy link
Contributor Author

@kimishpatel Here I can see the reported raw latency for Qwen3-0.6B from both etLLM and optimum-et: https://github.com/pytorch/executorch/actions/runs/15546916104/job/43772530729. The numbers are not making sense to me, it shows the optimum-et generated PTE is 5x faster on same Samsung Galaxy S22 5G. I suspect if it's because the etLLM model is not exported with the same config we're using for optimum-et.

Comment on lines +282 to +296
elif [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
${CONDA_RUN} python -m examples.models.llama.export_llama \
--model llama3_2 \
--checkpoint "${DOWNLOADED_PATH}/consolidated.00.pth" \
--params "${DOWNLOADED_PATH}/params.json" \
-kv \
--use_sdpa_with_kv_cache \
-d fp32 \
-X \
--xnnpack-extended-ops \
-qmode 8da4w -G 32 -E 8,0 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible please refactor these in a later PR. It is fewer lines to review for change like this

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but do the benchmark numbers get reported to a dashboard? It will be easier to track numbers that way.

Also lets validate the numbers before landing

return [filename hasSuffix:@".pte"] && [filename.lowercaseString containsString:@"llama"];
return [filename hasSuffix:@".pte"] && [filename.lowercaseString containsString:@"llm"];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely the issue that causing no TPS reported neither for Qwen model, neither for etLLM generated one nor for the optimum-et generated one. Reschedule a new run with this fix here: https://github.com/pytorch/executorch/actions/runs/15549765343

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants