Why the operators in static-quantized model is sitll fp32 ? #24038
-
Hi there. I am a newbie to model quantization,after reading onnxruntime quant I know there are two basic quantization approaches:
Am I right? In my opinion, operators of a quantized model should be integer type (e.g. conv is replaced by ConvInteger). But I have tried the example in onnxruntime repo and visualize the static-quantized mobilenet,as shown below:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Yes, exactly. The idea of the QDQ approach is that it allows expressing quantization of many different operators without needing to introduce quantized variants of each. To actually take advantage of quantization to improve performance, the runtime will need to fuse the operators internally. If the runtime does not support a particular fusion, then the model will still run, but without a performance improvement. ONNX Runtime has the ability to save the model graph after optimization. You can use that to see how ORT fuses a particular pattern of nodes. I found the current quantization documentation was not very helpful in understanding the overall picture of how quantization works in ONNX. The discussion in issues such as onnx/onnx#2659 provided more clarity. |
Beta Was this translation helpful? Give feedback.
By default ONNX Runtime will apply graph optimizations when it loads a model, so it will work the same as if you saved the optimized model offline, but may take longer to load. Only if you turn off graph optimizations will it run un-optimized.