[CUDA] stable diffusion benchmark allows IO binding for optimum #22834

tianleiwu · 2024-11-14T01:10:03Z

Description

Update stable diffusion benchmark:
(1) allow IO binding for optimum.
(2) do not use num_images_per_prompt across all engines for fair comparison.

Example to run benchmark of optimum on stable diffusion 1.5:

git clone https://github.com/tianleiwu/optimum
cd optimum
git checkout tlwu/diffusers-io-binding
pip install -e .

pip install -U onnxruntime-gpu
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion
git checkout tlwu/benchmark_sd_optimum_io_binding
pip install -r requirements/cuda12/requirements.txt

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5  --task text-to-image ./sd_onnx_fp32

python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding

Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without IO Binding; IO binding gains 16ms, or 2.7%,

Motivation and Context

Optimum is working on enabling I/O binding: huggingface/optimum#2056. This could help testing the impact of I/O binding on the performance of the stable diffusion.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py

kunal-vaishnavi · 2024-11-14T05:55:04Z

We should upgrade the Optimum version here once those changes are merged.

onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/requirements/requirements.txt

Line 15 in 3cccde4

optimum==1.20.0

…osoft#22834) ### Description Update stable diffusion benchmark: (1) allow IO binding for optimum. (2) do not use num_images_per_prompt across all engines for fair comparison. Example to run benchmark of optimum on stable diffusion 1.5: ``` git clone https://github.com/tianleiwu/optimum cd optimum git checkout tlwu/diffusers-io-binding pip install -e . pip install -U onnxruntime-gpu git clone https://github.com/microsoft/onnxruntime cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion git checkout tlwu/benchmark_sd_optimum_io_binding pip install -r requirements/cuda12/requirements.txt optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 --task text-to-image ./sd_onnx_fp32 python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding ``` Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without IO Binding; IO binding gains 16ms, or 2.7%, ### Motivation and Context Optimum is working on enabling I/O binding: huggingface/optimum#2056. This could help testing the impact of I/O binding on the performance of the stable diffusion.

### Description Update stable diffusion benchmark: (1) allow IO binding for optimum. (2) do not use num_images_per_prompt across all engines for fair comparison. Example to run benchmark of optimum on stable diffusion 1.5: ``` git clone https://github.com/tianleiwu/optimum cd optimum git checkout tlwu/diffusers-io-binding pip install -e . pip install -U onnxruntime-gpu git clone https://github.com/microsoft/onnxruntime cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion git checkout tlwu/benchmark_sd_optimum_io_binding pip install -r requirements/cuda12/requirements.txt optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 --task text-to-image ./sd_onnx_fp32 python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding ``` Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without IO Binding; IO binding gains 16ms, or 2.7%, ### Motivation and Context Optimum is working on enabling I/O binding: huggingface/optimum#2056. This could help testing the impact of I/O binding on the performance of the stable diffusion.

…osoft#22834) ### Description Update stable diffusion benchmark: (1) allow IO binding for optimum. (2) do not use num_images_per_prompt across all engines for fair comparison. Example to run benchmark of optimum on stable diffusion 1.5: ``` git clone https://github.com/tianleiwu/optimum cd optimum git checkout tlwu/diffusers-io-binding pip install -e . pip install -U onnxruntime-gpu git clone https://github.com/microsoft/onnxruntime cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion git checkout tlwu/benchmark_sd_optimum_io_binding pip install -r requirements/cuda12/requirements.txt optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 --task text-to-image ./sd_onnx_fp32 python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding ``` Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without IO Binding; IO binding gains 16ms, or 2.7%, ### Motivation and Context Optimum is working on enabling I/O binding: huggingface/optimum#2056. This could help testing the impact of I/O binding on the performance of the stable diffusion.

tianleiwu added 2 commits November 13, 2024 15:56

update sd benchmark to allow IO binding for optimum

ba6fd3f

Do not use num_images_per_prompt

34ee286

tianleiwu requested review from kunal-vaishnavi and jiafatom November 14, 2024 01:11

github-actions bot reviewed Nov 14, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Outdated Show resolved Hide resolved

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Outdated Show resolved Hide resolved

use 10 prompts for perf test by default

3cccde4

kunal-vaishnavi approved these changes Nov 14, 2024

View reviewed changes

tianleiwu merged commit 09c9843 into main Nov 14, 2024
93 checks passed

tianleiwu deleted the tlwu/benchmark_sd_optimum_io_binding branch November 14, 2024 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] stable diffusion benchmark allows IO binding for optimum #22834

[CUDA] stable diffusion benchmark allows IO binding for optimum #22834

tianleiwu commented Nov 14, 2024 •

edited

Loading

github-actions bot left a comment

kunal-vaishnavi commented Nov 14, 2024

[CUDA] stable diffusion benchmark allows IO binding for optimum #22834

[CUDA] stable diffusion benchmark allows IO binding for optimum #22834

Conversation

tianleiwu commented Nov 14, 2024 • edited Loading

Description

Motivation and Context

github-actions bot left a comment

Choose a reason for hiding this comment

kunal-vaishnavi commented Nov 14, 2024

tianleiwu commented Nov 14, 2024 •

edited

Loading