Dump nodes with potential overflow in half conversion #23363

tianleiwu · 2025-01-14T20:07:11Z

Description

Add a tool to generate node_block_list used in float16 conversion tool.

Previously, we have a feature to dump statistics data (like min, max) of each node input/output. However, it is time consuming to generate a list of nodes that need to be kept in float32 when model is large.

This could help speed up the process by outputting a list of nodes that have potential overflow in float-to-half conversion.

Usage is to build onnxruntime from source with --cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1, then set some environment variables before running float32 optimized onnx model like:

export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1
export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000

python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup

The threshold ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD shall be <= 65504. The default value is 50000 if the environment variable is not set. It is better to leave some margin if number of samples are not large enough in the test.

As a demo, we add an option --skip_warmup to benchmark.py for Flux, so that we can reduce the time on dumping warm-up runs.

Example snippet of stdout (each inference session has such a summary when session ended):

Total counter in node dumping: 141
Found 2 nodes cannot be converted to half precision due to potential input/output overflow.
Operator frequencies for these nodes:
Softmax : 1
MatMul : 1
# -------
# Example python script for float16 conversion
# For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py
# -------
from onnxruntime.transformers.onnx_model import OnnxModel
m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx'))
node_block_list = [
  '/decoder/mid_block/attentions.0/Softmax',
  '/decoder/mid_block/attentions.0/MatMul',
]
m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list)
m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False)

Then you can use the python script to convert corresponding model to float16.

Motivation and Context

It is a tool used to generate node_block_list used in float16 conversion of stable diffusion 3.x and flux models in #22986.

In stable diffusion or Flux pipeline, there are multiple models and there could be multiple session runs for each model. Without a proper tool, it is time consuming to get node_block_list for each model.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/framework/debug_node_inputs_outputs_utils.cc

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py

onnxruntime/core/framework/debug_node_inputs_outputs_utils.cc

jiafatom · 2025-01-16T02:33:35Z

Is the build failure related?

Add a tool to generate node_block_list used in [float16 conversion tool](https://github.com/microsoft/onnxruntime/blob/04030f64be10e020d3ac9aa5ba7d0f2917cbd14e/onnxruntime/python/tools/transformers/float16.py#L175). Previously, we have a feature to dump statistics data (like min, max) of each node input/output. However, it is time consuming to generate a list of nodes that need to be kept in float32 when model is large. This could help speed up the process by outputting a list of nodes that have potential overflow in float-to-half conversion. Usage is to build onnxruntime from source with ` --cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment variables before running float32 optimized onnx model like: ``` export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1 export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000 python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup ``` The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <= 65504. The default value is 50000 if the environment variable is not set. It is better to leave some margin if number of samples are not large enough in the test. As a demo, we add an option --skip_warmup to benchmark.py for Flux, so that we can reduce the time on dumping warm-up runs. Example snippet of stdout (each inference session has such a summary when session ended): ``` Total counter in node dumping: 141 Found 2 nodes cannot be converted to half precision due to potential input/output overflow. Operator frequencies for these nodes: Softmax : 1 MatMul : 1 # ------- # Example python script for float16 conversion # For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py # ------- from onnxruntime.transformers.onnx_model import OnnxModel m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx')) node_block_list = [ '/decoder/mid_block/attentions.0/Softmax', '/decoder/mid_block/attentions.0/MatMul', ] m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list) m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False) ``` Then you can use the python script to convert corresponding model to float16. ### Motivation and Context It is a tool used to generate node_block_list used in float16 conversion of stable diffusion 3.x and flux models in #22986. In stable diffusion or Flux pipeline, there are multiple models and there could be multiple session runs for each model. Without a proper tool, it is time consuming to get node_block_list for each model.

fs-eire · 2025-01-17T07:10:35Z

Is the build failure related?

Post-build failure should be related to this change.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1591177&view=logs&j=13f69daf-8ac7-5f22-55d7-dd110d7b97a5&t=668a5a1a-1b81-596b-5fa9-100aa2e18c66

### Description - Fix a type cast in #23363. - Include some headers which are suggested by code scanning in that PR. ### Motivation and Context PostMerge has build error: ``` onnxruntime\core\framework\print_tensor_statistics_utils.h(92,55): error C2220: the following warning is treated as an error [D:\a\_work\1\b\Debug\onnxruntime_framework.vcxproj] ```

Add a tool to generate node_block_list used in [float16 conversion tool](https://github.com/microsoft/onnxruntime/blob/04030f64be10e020d3ac9aa5ba7d0f2917cbd14e/onnxruntime/python/tools/transformers/float16.py#L175). Previously, we have a feature to dump statistics data (like min, max) of each node input/output. However, it is time consuming to generate a list of nodes that need to be kept in float32 when model is large. This could help speed up the process by outputting a list of nodes that have potential overflow in float-to-half conversion. Usage is to build onnxruntime from source with ` --cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment variables before running float32 optimized onnx model like: ``` export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1 export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000 python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup ``` The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <= 65504. The default value is 50000 if the environment variable is not set. It is better to leave some margin if number of samples are not large enough in the test. As a demo, we add an option --skip_warmup to benchmark.py for Flux, so that we can reduce the time on dumping warm-up runs. Example snippet of stdout (each inference session has such a summary when session ended): ``` Total counter in node dumping: 141 Found 2 nodes cannot be converted to half precision due to potential input/output overflow. Operator frequencies for these nodes: Softmax : 1 MatMul : 1 # ------- # Example python script for float16 conversion # For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py # ------- from onnxruntime.transformers.onnx_model import OnnxModel m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx')) node_block_list = [ '/decoder/mid_block/attentions.0/Softmax', '/decoder/mid_block/attentions.0/MatMul', ] m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list) m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False) ``` Then you can use the python script to convert corresponding model to float16. ### Motivation and Context It is a tool used to generate node_block_list used in float16 conversion of stable diffusion 3.x and flux models in #22986. In stable diffusion or Flux pipeline, there are multiple models and there could be multiple session runs for each model. Without a proper tool, it is time consuming to get node_block_list for each model.

### Description - Fix a type cast in #23363. - Include some headers which are suggested by code scanning in that PR. ### Motivation and Context PostMerge has build error: ``` onnxruntime\core\framework\print_tensor_statistics_utils.h(92,55): error C2220: the following warning is treated as an error [D:\a\_work\1\b\Debug\onnxruntime_framework.vcxproj] ```

Add a tool to generate node_block_list used in [float16 conversion tool](https://github.com/microsoft/onnxruntime/blob/04030f64be10e020d3ac9aa5ba7d0f2917cbd14e/onnxruntime/python/tools/transformers/float16.py#L175). Previously, we have a feature to dump statistics data (like min, max) of each node input/output. However, it is time consuming to generate a list of nodes that need to be kept in float32 when model is large. This could help speed up the process by outputting a list of nodes that have potential overflow in float-to-half conversion. Usage is to build onnxruntime from source with ` --cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment variables before running float32 optimized onnx model like: ``` export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1 export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000 python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup ``` The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <= 65504. The default value is 50000 if the environment variable is not set. It is better to leave some margin if number of samples are not large enough in the test. As a demo, we add an option --skip_warmup to benchmark.py for Flux, so that we can reduce the time on dumping warm-up runs. Example snippet of stdout (each inference session has such a summary when session ended): ``` Total counter in node dumping: 141 Found 2 nodes cannot be converted to half precision due to potential input/output overflow. Operator frequencies for these nodes: Softmax : 1 MatMul : 1 # ------- # Example python script for float16 conversion # For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py # ------- from onnxruntime.transformers.onnx_model import OnnxModel m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx')) node_block_list = [ '/decoder/mid_block/attentions.0/Softmax', '/decoder/mid_block/attentions.0/MatMul', ] m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list) m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False) ``` Then you can use the python script to convert corresponding model to float16. ### Motivation and Context It is a tool used to generate node_block_list used in float16 conversion of stable diffusion 3.x and flux models in #22986. In stable diffusion or Flux pipeline, there are multiple models and there could be multiple session runs for each model. Without a proper tool, it is time consuming to get node_block_list for each model.

### Description - Fix a type cast in #23363. - Include some headers which are suggested by code scanning in that PR. ### Motivation and Context PostMerge has build error: ``` onnxruntime\core\framework\print_tensor_statistics_utils.h(92,55): error C2220: the following warning is treated as an error [D:\a\_work\1\b\Debug\onnxruntime_framework.vcxproj] ```

dump half conversion node block list

3bc2384

tianleiwu marked this pull request as draft January 14, 2025 20:07

github-actions bot reviewed Jan 14, 2025

View reviewed changes

onnxruntime/core/framework/debug_node_inputs_outputs_utils.cc Outdated Show resolved Hide resolved

tianleiwu added 3 commits January 14, 2025 21:49

Merge branch 'main' into tlwu/dump_fp16_node_block_list

ade4952

print node_block_list

a2ed1b2

output model path

0541423

tianleiwu marked this pull request as ready for review January 15, 2025 07:59

tianleiwu requested review from edgchen1, kunal-vaishnavi and jiafatom January 15, 2025 17:42

jiafatom reviewed Jan 15, 2025

View reviewed changes

tianleiwu added 3 commits January 15, 2025 22:20

output op frequency

02a713f

fix /2

76f137c

fix typo

690fda8

tianleiwu requested a review from jiafatom January 16, 2025 20:45

jiafatom approved these changes Jan 16, 2025

View reviewed changes

tianleiwu merged commit 5735e1b into main Jan 16, 2025
98 checks passed

tianleiwu deleted the tlwu/dump_fp16_node_block_list branch January 16, 2025 20:54

tianleiwu mentioned this pull request Jan 17, 2025

Fix type cast build error #23423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dump nodes with potential overflow in half conversion #23363

Dump nodes with potential overflow in half conversion #23363

tianleiwu commented Jan 14, 2025 •

edited

Loading

github-actions bot left a comment

jiafatom commented Jan 16, 2025

fs-eire commented Jan 17, 2025

Dump nodes with potential overflow in half conversion #23363

Dump nodes with potential overflow in half conversion #23363

Conversation

tianleiwu commented Jan 14, 2025 • edited Loading

Description

Motivation and Context

github-actions bot left a comment

Choose a reason for hiding this comment

jiafatom commented Jan 16, 2025

fs-eire commented Jan 17, 2025

tianleiwu commented Jan 14, 2025 •

edited

Loading