It's slow. #152

myan-o · 2025-01-26T05:43:23Z

I tried running 8B on one Snapdragon Gen3. It was about the same speed as 14B on llama.cpp. Is this a specification?

It is built using the arm extended instructions dotprod and i8mm.
openblas.
Even without using extended instructions, there was a 4x speed difference.

Distributed-Llama

https://huggingface.co/b4rtaz/Llama-3-8B-Q40-Instruct-Distributed-Llama

time ./dllama inference --
steps 64 --prompt "Hello world" --model dllama_model_la
ma3_instruct_q40.m --tokenizer dllama_tokenizer_llama3.
t --buffer-float-type q80 --nthreads 6

⏩ Loaded 6175568 kB
🔶 G  873 ms I  862 ms T   10 ms S      0 kB R      0 k
B Hello
🔶 G  794 ms I  785 ms T    8 ms S      0 kB R      0 k
B  world
🔶 G  809 ms I  793 ms T   14 ms S      0 kB R      0 k
B !
🔶 G  808 ms I  792 ms T   13 ms S      0 kB R      0 k
B  �
🔶 G  851 ms I  837 ms T   12 ms S      0 kB R      0 k
B 
🔶 G  793 ms I  782 ms T    9 ms S      0 kB R      0 k
B 


🔶 G  812 ms I  799 ms T   11 ms S      0 kB R      0 k
B This
🔶 G  800 ms I  786 ms T   12 ms S      0 kB R      0 k
B  is
🔶 G  849 ms I  832 ms T   15 ms S      0 kB R      0 k
B  my
🔶 G  800 ms I  787 ms T   11 ms S      0 kB R      0 k
B  first
🔶 G  802 ms I  787 ms T   13 ms S      0 kB R      0 k
B  blog
🔶 G  803 ms I  791 ms T   10 ms S      0 kB R      0 k
B  post
🔶 G  839 ms I  822 ms T   15 ms S      0 kB R      0 k
B .

real    0m23.729s
user    1m8.151s
sys     0m10.009s

llama.cpp

https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF/blob/main/DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf

time ./llama-run ~/gguf/DeepSeek-R1-Dist
ill-Llama-8B-Q4_0.gguf "Hello world"
<think>

</think>

Hello! How can I assist you today? 😊

real    0m6.087s
user    0m13.465s
sys     0m1.835s

b4rtaz · 2025-01-28T08:19:26Z

Hello @myan-o,

are you referring to token prediction or token evaluation? Unfortunately, Distributed Llama does not currently support token evaluation.

myan-o · 2025-01-28T08:47:19Z

I’m referring to token prediction. The token generation is slow. Could you provide any advice on how to improve its performance?

b4rtaz · 2025-01-28T08:54:15Z

What mode are you using? Can you provide more details about the performance differences?

myan-o · 2025-01-28T09:43:55Z

I've added information to the issue. llama.cpp. is about x4 faster. It's so fast that I can feel it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It's slow. #152

It's slow. #152

myan-o commented Jan 26, 2025 •

edited

Loading

b4rtaz commented Jan 28, 2025

myan-o commented Jan 28, 2025

b4rtaz commented Jan 28, 2025

myan-o commented Jan 28, 2025 •

edited

Loading

It's slow. #152

It's slow. #152

Comments

myan-o commented Jan 26, 2025 • edited Loading

Distributed-Llama

llama.cpp

b4rtaz commented Jan 28, 2025

myan-o commented Jan 28, 2025

b4rtaz commented Jan 28, 2025

myan-o commented Jan 28, 2025 • edited Loading

myan-o commented Jan 26, 2025 •

edited

Loading

myan-o commented Jan 28, 2025 •

edited

Loading