-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It's slow. #152
Comments
Hello @myan-o, are you referring to token prediction or token evaluation? Unfortunately, Distributed Llama does not currently support token evaluation. |
I’m referring to token prediction. The token generation is slow. Could you provide any advice on how to improve its performance? |
What mode are you using? Can you provide more details about the performance differences? |
I've added information to the issue. llama.cpp. is about x4 faster. It's so fast that I can feel it. |
I tried running 8B on one Snapdragon Gen3. It was about the same speed as 14B on llama.cpp. Is this a specification?
It is built using the arm extended instructions dotprod and i8mm.openblas.
Even without using extended instructions, there was a 4x speed difference.
Distributed-Llama
https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Instruct-Distributed-Llama
llama.cpp
https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF/blob/main/DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf
The text was updated successfully, but these errors were encountered: