-
Notifications
You must be signed in to change notification settings - Fork 588
Benchmark HF optimum-executorch #11450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11450
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
e4718b0
to
fff15c6
Compare
fff15c6
to
00149f2
Compare
00149f2
to
112eb2b
Compare
112eb2b
to
a38a694
Compare
a38a694
to
a0f636f
Compare
a0f636f
to
5d6dd04
Compare
@huydhn Okay, it turns out that I need to run install with |
8aa9c02
to
b0d829a
Compare
-X \ | ||
--xnnpack-extended-ops \ | ||
-qmode 8da4w -G 32 -E 8,0 \ | ||
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these for llama_3_2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimishpatel Yeah, for llama_3_2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimishpatel @jackzhxng can you confirm if this is the correct config we should use to export Qwen3 via etLLM path? The perf numbers reported here doesn't make sense to me #11450 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know for qwen3. Can you compare the file sizes for the two? Also use --xnnpack-extended-ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind. you are using the option i mentioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the command for qwen, right? That one is the one below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one I dont see hf counterpart
I'm seeing jobs hitting API Limits in AWS Device Farm. We lifted it for public AWS devices, @huydhn do we need to do same and separately for new devices in private pools? https://github.com/pytorch/executorch/actions/runs/15504512047 |
Both benchmark jobs are finished successfully, but upon checking the benchmark_results.json, they are empty. https://github.com/pytorch/executorch/actions/runs/15540702059/job/43754028987 @kirklandsign any idea why? And all these runs: https://github.com/pytorch/executorch/actions/runs/15543294199. I would expect those to fail due to the issues in the tokenizer support in llama runner. |
6c80e04
to
8e96647
Compare
8e96647
to
d12c6f6
Compare
Disable passing tokenizer to the Android app will make it work for Qwen3 from both etLLM and optimum-et as showin here: https://github.com/pytorch/executorch/actions/runs/15546916104/job/43772530729. The iOS app can not run the optimum-et generated PTE even after disabling the tokenizer. That is, run it as a regular PTE doesn't work as expected. https://github.com/pytorch/executorch/actions/runs/15540727931/job/43752973136. cc: @shoumikhin |
@kimishpatel Here I can see the reported raw latency for Qwen3-0.6B from both etLLM and optimum-et: https://github.com/pytorch/executorch/actions/runs/15546916104/job/43772530729. The numbers are not making sense to me, it shows the optimum-et generated PTE is 5x faster on same Samsung Galaxy S22 5G. I suspect if it's because the etLLM model is not exported with the same config we're using for optimum-et. |
elif [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then | ||
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth") | ||
${CONDA_RUN} python -m examples.models.llama.export_llama \ | ||
--model llama3_2 \ | ||
--checkpoint "${DOWNLOADED_PATH}/consolidated.00.pth" \ | ||
--params "${DOWNLOADED_PATH}/params.json" \ | ||
-kv \ | ||
--use_sdpa_with_kv_cache \ | ||
-d fp32 \ | ||
-X \ | ||
--xnnpack-extended-ops \ | ||
-qmode 8da4w -G 32 -E 8,0 \ | ||
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ | ||
--output_name="${OUT_ET_MODEL_NAME}.pte" | ||
ls -lh "${OUT_ET_MODEL_NAME}.pte" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible please refactor these in a later PR. It is fewer lines to review for change like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me but do the benchmark numbers get reported to a dashboard? It will be easier to track numbers that way.
Also lets validate the numbers before landing
return [filename hasSuffix:@".pte"] && [filename.lowercaseString containsString:@"llama"]; | ||
return [filename hasSuffix:@".pte"] && [filename.lowercaseString containsString:@"llm"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely the issue that causing no TPS reported neither for Qwen model, neither for etLLM generated one nor for the optimum-et generated one. Reschedule a new run with this fix here: https://github.com/pytorch/executorch/actions/runs/15549765343
Benchmark LLMs from
optimum-executorch
. With all the work recently happening inoptimum-executorch
, we are able to boost the out-of-the-box performance. Putting these models on benchmark infra to gather perf numbers and understand the remaining perf gaps between the in-house generated model via export_llama.We are able to do apple-to-apple comparison for CPU backend by introducing quant, custom SPDA, custom KV cache to native Hugging Face models in
optimum-executorch
:hf_xnnpack_custom_spda_kv_cache_8da4w
represents the recipe used by optimum-et,et_xnnpack_custom_spda_kv_cache_8da4w
is the counterpart for etLLM.Here are the benchmark jobs in our infra:
Note there may be failures when running optimum-et models on-device due to lack of support HF tokenizers in llama runner. I will remove packing tokenizer.json from the .zip shortly so that the benchmark apps will take optimum-et LLMs as non-GenAI models.