Description
🐛 Describe the bug
I've encountered a performance issue where executorch's inference speed is significantly slower compared to ONNX, both on linux pc and Android phone. I believe this is a critical issue that needs to be addressed as it affects the efficiency of our model deployment.
Environment:
onnx==1.17.0
onnxruntime==1.20.0
executorch==0.3.0
torch==2.4.0+cu121
python=3.10.15
Linux pc hardware: NVIDIA A100 80GB, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Android phone hardware: Qualcomm Snapdragon 8+ Gen 1
Reproduction Steps:
The vit is an InternVIT-300M model, with 7 * 3 * 448 * 448 input size.
I export vit model with:
python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize
And inference it on linux pc with:
./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./internvit_xnnpack_q8.pte
inference on Android with:
adb shell ./data/local/tmp/vit/xnn_executor_runner_android --model_path /data/local/tmp/vit/internvit_xnnpack_q8.pte
Expected Behavior:
I'm not quite familiar with inference times for both ONNX and executorch, but I thought they should be within an acceptable performance margin. And I've already exported a llama2-2B model, with a considerable speed TTFT 0.5s + 30 tokens/s on my Android phone. I thought vit-300M inference speed shoud be some how similiar.
Actual Behavior:
onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s
Questions:
Is there any known performance regression in executorch compared to ONNX?
Are there any optimization techniques or configurations that can improve vit excutorch's performance?
I would appreciate any guidance on how to resolve this performance discrepancy. Thank you for your attention to this issue.
Versions
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31
Python version: 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-191-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.6.68
CUDA_MODULE_LOADING set to: LAZY
### Tasks