You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your interest!
I should first mention that this PyTorch implementation of I-BERT only searches for the integer parameters (i.e., performs quantization-aware-training) that minimize the accuracy degradation as compared to the full-precision counterpart.
As far as I know, PyTorch does not support integer operations (unless using its own quantization library, whose functionality is very limited) and thus the current PyTorch implementation does not achieve latency reduction on real hardware by itself.
In order to deploy I-BERT on GPU or CPU and achieve speedup, you should additionally export the integer parameters (which are obtained from this implementation) along with the model architecture to other frameworks that support deployment on integer processing units. TVM and TensorRT are such examples.
Excellent work!
Can use the CPU in the inference state?
And how much faster than baseline?
The text was updated successfully, but these errors were encountered: