Skip to content

vit executorch inference speed much slower than onnx #6961

Open
@salvadog

Description

@salvadog

🐛 Describe the bug

I've encountered a performance issue where executorch's inference speed is significantly slower compared to ONNX, both on linux pc and Android phone. I believe this is a critical issue that needs to be addressed as it affects the efficiency of our model deployment.

Environment:

onnx==1.17.0
onnxruntime==1.20.0
executorch==0.3.0
torch==2.4.0+cu121
python=3.10.15

Linux pc hardware: NVIDIA A100 80GB, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Android phone hardware: Qualcomm Snapdragon 8+ Gen 1

Reproduction Steps:

The vit is an InternVIT-300M model, with 7 * 3 * 448 * 448 input size.

I export vit model with:

python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize

And inference it on linux pc with:

./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./internvit_xnnpack_q8.pte

inference on Android with:

adb shell ./data/local/tmp/vit/xnn_executor_runner_android --model_path /data/local/tmp/vit/internvit_xnnpack_q8.pte

Expected Behavior:

I'm not quite familiar with inference times for both ONNX and executorch, but I thought they should be within an acceptable performance margin. And I've already exported a llama2-2B model, with a considerable speed TTFT 0.5s + 30 tokens/s on my Android phone. I thought vit-300M inference speed shoud be some how similiar.

Actual Behavior:

onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s

Questions:

Is there any known performance regression in executorch compared to ONNX?
Are there any optimization techniques or configurations that can improve vit excutorch's performance?
I would appreciate any guidance on how to resolve this performance discrepancy. Thank you for your attention to this issue.

Versions

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31

Python version: 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-191-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.6.68
CUDA_MODULE_LOADING set to: LAZY

### Tasks

cc @digantdesai @mcr229 @cbilgin

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: xnnpackIssues related to xnnpack delegation and the code under backends/xnnpack/need-user-inputThe issue needs more information from the reporter before moving forwardtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions