Why the operators in static-quantized model is sitll fp32 ? #24038

stricklandye · 2025-03-14T08:04:13Z

stricklandye
Mar 14, 2025

Hi there. I am a newbie to model quantization,after reading onnxruntime quant I know there are two basic quantization approaches:

dynamic quantization: quantize fp32 weight to int8 during quantization phase , compute quant params (scale and zero point) on the fly which will increase performance overhead when doing inference but its accuracy may be a little bit higher.
static quantization: same as dynamic quantization, it quantizes weight during quantization phase and then use a calibration dataset to compute quant params then export these params into model file. So there is no performance penalty but accuracy may be lower than dynamic quantization.

Am I right? In my opinion, operators of a quantized model should be integer type (e.g. conv is replaced by ConvInteger). But I have tried the example in onnxruntime repo and visualize the static-quantized mobilenet,as shown below:

The image shows ,conv receives de-quantized x, w,input . In other words , conv is still doing fp32 calculation. So where does performance acceleration come from? I've read some docs , they say execution provider will detect whether a onnx model contains QDQ node, if its' true the provider will fuse QDQ and op into quantized format , is it right? Here is a example I quoted from a nvidia slice:

Answered by robertknight

Mar 14, 2025

Another question is, does the offline optimization is necessary to static quantization ? If I don't apply offline optimization just leave the qdq nodes in onnx model , there is no peformance acceleration and the performance will degrae because of execution of the de-quant node?

By default ONNX Runtime will apply graph optimizations when it loads a model, so it will work the same as if you saved the optimized model offline, but may take longer to load. Only if you turn off graph optimizations will it run un-optimized.

View full answer

robertknight · 2025-03-14T09:09:09Z

robertknight
Mar 14, 2025

So where does performance acceleration come from? I've read some docs , they say execution provider will detect whether a onnx model contains QDQ node, if its' true the provider will fuse QDQ and op into quantized format , is it right?

Yes, exactly. The idea of the QDQ approach is that it allows expressing quantization of many different operators without needing to introduce quantized variants of each. To actually take advantage of quantization to improve performance, the runtime will need to fuse the operators internally. If the runtime does not support a particular fusion, then the model will still run, but without a performance improvement. ONNX Runtime has the ability to save the model graph after optimization. You can use that to see how ORT fuses a particular pattern of nodes.

I found the current quantization documentation was not very helpful in understanding the overall picture of how quantization works in ONNX. The discussion in issues such as onnx/onnx#2659 provided more clarity.

3 replies

stricklandye Mar 14, 2025
Author

wow ! Thanks for reply. Using rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED the model was quantized as I expected,some conv nodes of original model were then replaced by QLinearConv. I will check out the disscussion you have mentioned later.

Another question is, does the offline optimization is necessary to static quantization ? If I don't apply offline optimization just leave the qdq nodes in onnx model , there is no peformance acceleration and the performance will degrae because of execution of the de-quant node?

robertknight Mar 14, 2025

Another question is, does the offline optimization is necessary to static quantization ? If I don't apply offline optimization just leave the qdq nodes in onnx model , there is no peformance acceleration and the performance will degrae because of execution of the de-quant node?

By default ONNX Runtime will apply graph optimizations when it loads a model, so it will work the same as if you saved the optimized model offline, but may take longer to load. Only if you turn off graph optimizations will it run un-optimized.

Answer selected by stricklandye

stricklandye Mar 14, 2025
Author

I see. Thank you much. These questions have confused me for days. :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the operators in static-quantized model is sitll fp32 ? #24038

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why the operators in static-quantized model is sitll fp32 ? #24038

stricklandye Mar 14, 2025

Replies: 1 comment · 3 replies

robertknight Mar 14, 2025

stricklandye Mar 14, 2025 Author

robertknight Mar 14, 2025

stricklandye Mar 14, 2025 Author

stricklandye
Mar 14, 2025

Replies: 1 comment 3 replies

robertknight
Mar 14, 2025

stricklandye Mar 14, 2025
Author

stricklandye Mar 14, 2025
Author