Can use the CPU in the inference state? #1

luoling1993 · 2021-03-18T09:38:59Z

Excellent work!

Can use the CPU in the inference state?
And how much faster than baseline?

kssteven418 · 2021-03-18T10:20:50Z

Thanks for your interest!
I should first mention that this PyTorch implementation of I-BERT only searches for the integer parameters (i.e., performs quantization-aware-training) that minimize the accuracy degradation as compared to the full-precision counterpart.
As far as I know, PyTorch does not support integer operations (unless using its own quantization library, whose functionality is very limited) and thus the current PyTorch implementation does not achieve latency reduction on real hardware by itself.
In order to deploy I-BERT on GPU or CPU and achieve speedup, you should additionally export the integer parameters (which are obtained from this implementation) along with the model architecture to other frameworks that support deployment on integer processing units. TVM and TensorRT are such examples.

Hope this answers your question!

luoling1993 added the question Further information is requested label Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can use the CPU in the inference state? #1

Can use the CPU in the inference state? #1

luoling1993 commented Mar 18, 2021

kssteven418 commented Mar 18, 2021

Can use the CPU in the inference state? #1

Can use the CPU in the inference state? #1

Comments

luoling1993 commented Mar 18, 2021

kssteven418 commented Mar 18, 2021