Replies: 1 comment
-
Thanks for your suggestions. From our insights, GPUs are not well-suited for LUT due to their limited on-chip memory per core. Placing a LUT on shared memory can lead to slow random access due to bank conflict. However, it's still a viable solution to use CPU/GPU/NPU in concert, while GPU/NPU using dequant-based method and CPU using T-MAC. We are exploring the possibility. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
If the enhancement kernel supports both cpu and gpu accelerated reasoning. In addition, it could support distributed computing. This would instantly become one of the faster and more useful inference engine algorithms!
Beta Was this translation helpful? Give feedback.
All reactions