Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's slow. #152

Open
myan-o opened this issue Jan 26, 2025 · 4 comments
Open

It's slow. #152

myan-o opened this issue Jan 26, 2025 · 4 comments

Comments

@myan-o
Copy link

myan-o commented Jan 26, 2025

I tried running 8B on one Snapdragon Gen3. It was about the same speed as 14B on llama.cpp. Is this a specification?

It is built using the arm extended instructions dotprod and i8mm.
openblas.

Even without using extended instructions, there was a 4x speed difference.

Distributed-Llama

https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Instruct-Distributed-Llama

time ./dllama inference --
steps 64 --prompt "Hello world" --model dllama_model_la
ma3_instruct_q40.m --tokenizer dllama_tokenizer_llama3.
t --buffer-float-type q80 --nthreads 6
⏩ Loaded 6175568 kB
🔶 G  873 ms I  862 ms T   10 ms S      0 kB R      0 k
B Hello
🔶 G  794 ms I  785 ms T    8 ms S      0 kB R      0 k
B  world
🔶 G  809 ms I  793 ms T   14 ms S      0 kB R      0 k
B !
🔶 G  808 ms I  792 ms T   13 ms S      0 kB R      0 k
B  �
🔶 G  851 ms I  837 ms T   12 ms S      0 kB R      0 k
B 
🔶 G  793 ms I  782 ms T    9 ms S      0 kB R      0 k
B 


🔶 G  812 ms I  799 ms T   11 ms S      0 kB R      0 k
B This
🔶 G  800 ms I  786 ms T   12 ms S      0 kB R      0 k
B  is
🔶 G  849 ms I  832 ms T   15 ms S      0 kB R      0 k
B  my
🔶 G  800 ms I  787 ms T   11 ms S      0 kB R      0 k
B  first
🔶 G  802 ms I  787 ms T   13 ms S      0 kB R      0 k
B  blog
🔶 G  803 ms I  791 ms T   10 ms S      0 kB R      0 k
B  post
🔶 G  839 ms I  822 ms T   15 ms S      0 kB R      0 k
B .

real    0m23.729s
user    1m8.151s
sys     0m10.009s

llama.cpp

https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF/blob/main/DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf

time ./llama-run ~/gguf/DeepSeek-R1-Dist
ill-Llama-8B-Q4_0.gguf "Hello world"
<think>

</think>

Hello! How can I assist you today? 😊

real    0m6.087s
user    0m13.465s
sys     0m1.835s
@b4rtaz
Copy link
Owner

b4rtaz commented Jan 28, 2025

Hello @myan-o,

are you referring to token prediction or token evaluation? Unfortunately, Distributed Llama does not currently support token evaluation.

@myan-o
Copy link
Author

myan-o commented Jan 28, 2025

I’m referring to token prediction. The token generation is slow. Could you provide any advice on how to improve its performance?

@b4rtaz
Copy link
Owner

b4rtaz commented Jan 28, 2025

What mode are you using? Can you provide more details about the performance differences?

@myan-o
Copy link
Author

myan-o commented Jan 28, 2025

I've added information to the issue. llama.cpp. is about x4 faster. It's so fast that I can feel it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants