Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance compared to llama.cpp origin #40

Open
idreamerhx opened this issue Sep 6, 2024 · 7 comments
Open

Slow performance compared to llama.cpp origin #40

idreamerhx opened this issue Sep 6, 2024 · 7 comments
Labels
question Further information is requested

Comments

@idreamerhx
Copy link

idreamerhx commented Sep 6, 2024

I have try on two platforms, 12490f with 64G 6400GHz DDR5, EPYC 7302 16C 3.0GHz 128G 3200 DDR4 (memory read 118GB/s)

there is log on 7302, firstly t-mac and secondly for llamacpp latest for model with similar config (llama3 8b instraction 4bit/Q40)

I' post logs for 12490f and also 14700 later. For 12490f is about 3 times slower for prefill.

So, why . What's missing. And how to speedup tk/s especially prefill using t-mac.

#30

Thx.

Running command in /opt/T-MAC/3rdparty/llama.cpp/build:
/opt/T-MAC/3rdparty/llama.cpp/build/bin/main -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -n 128 -t 4 -p Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. -ngl 0 -c 2048
Check logs/2024-09-06-08-36-28.log for inference output
(base) root@b96728f2aeed:/opt/T-MAC# cat logs/2024-09-06-08-36-28.log
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2854 (70c312d)
main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu
main: seed = 1725611788
[08:36:28] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init

llm_load_print_meta: model size = 5.41 GiB (5.79 BPW)
llm_load_print_meta: general.name = Llama-3-8b-instruct-EfficientQAT-w4g128
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token = 128256 '[PAD]'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: CPU buffer size = 5749.03 MiB
..................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 0

<|begin_of_text|>Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.imersimersimersimersushihandsimersimersushiimersaqimersimersimersindeimersindeimerspinsimersellersimersayasimersindeushiimersushiushiimersayasaqimersindeimersindeushiimershandsushiayasimersushiimersindeushiimersindeimersimersorgeimersimersingoimersushiaqimersimersingoimersimersersimersorgeimersimersaqimersimersellersushiimersimersindeimershandsushihandsimersimersimersimersayasellersantimersushiimersimersimersimersimersimersimersimersimersayasushiimersimersimersuggyindeindeimerspinsaqimersimersimersimershandsimersimersorgeiringimersushiimersimersimersaqimersingohandsimershands
llama_print_timings: load time = 1000.39 ms
llama_print_timings: sample time = 10.08 ms / 128 runs ( 0.08 ms per token, 12704.71 tokens per second)
llama_print_timings: prompt eval time = 2035.98 ms / 18 tokens ( 113.11 ms per token, 8.84 tokens per second)
llama_print_timings: eval time = 18743.23 ms / 127 runs ( 147.58 ms per token, 6.78 tokens per second)
llama_print_timings: total time = 20853.05 ms / 145 tokens
Log end

####################################################################
llama.cpp
####################################################

./bin/llama-cli -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -n 128 -t 4 -p 'Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.' -ngl 0 -c 2048

llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 4437.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 (n_threads_batch = 4) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 0

Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. It is one of the largest and most successful technology companies in the world. Microsoft was founded in 1975 by Bill Gates and Paul Allen, and it is known for its Windows operating system, Microsoft Office software suite, and Xbox gaming console. The company is also known for its cloud computing platform, Azure, and its artificial intelligence (AI) and machine learning (ML) technologies.

Microsoft is a multinational corporation with operations in over 100 countries. It has a market capitalization of over $2 trillion and employs over 140,000 people worldwide. The company is headquartered in Redmond, Washington, and has major offices in New York
llama_print_timings: load time = 684.18 ms
llama_print_timings: sample time = 10.31 ms / 128 runs ( 0.08 ms per token, 12411.52 tokens per second)
llama_print_timings: prompt eval time = 1259.19 ms / 17 tokens ( 74.07 ms per token, 13.50 tokens per second)
llama_print_timings: eval time = 18773.20 ms / 127 runs ( 147.82 ms per token, 6.76 tokens per second)
llama_print_timings: total time = 20089.71 ms / 144 tokens
Log end

@kaleid-liner
Copy link
Collaborator

llama_print_timings: prompt eval time = 2035.98 ms / 18 tokens ( 113.11 ms per token, 8.84 tokens per second)
llama_print_timings: eval time = 18743.23 ms / 127 runs ( 147.58 ms per token, 6.78 tokens per second)

It's very weird the prefill and decoding phase have nearly the same throughput. #30 also demonstrates that even on a memory bottlenecked machine, the prefill still achieves significant speedup. Can you verify it using llama-bench instead? BTW, the generated tokens of T-MAC seem wrong. Is the model correctly generated?

llama_print_timings: eval time = 18743.23 ms / 127 runs ( 147.58 ms per token, 6.78 tokens per second)
llama_print_timings: eval time = 18773.20 ms / 127 runs ( 147.82 ms per token, 6.76 tokens per second)

The decoding performance seems to be very close. Can you share the version of your llama-cli? The latest llama.cpp has had some major updates to optimize multi-threading performance. We are working on updating our llama.cpp to the latest version to apply these updates #32 (comment).

llm_load_print_meta: model size = 5.41 GiB (5.79 BPW)

The model size of EfficientQAT is 5.41 GiB, while the model size of Q4_0 is 4.66 GiB (according to https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF-v2/tree/main). It's mainly because the large token embed of llama.cpp Q4_0 is quantized, while it's not quantized in EfficientQAT (or GPTQ). It could be further optimized. Currently I recommend using models like phi-3.5 or llama-2 for fair comparison.

@kaleid-liner kaleid-liner added the question Further information is requested label Sep 6, 2024
@idreamerhx
Copy link
Author

idreamerhx commented Sep 6, 2024

Thanks for your replay.

/opt/llama.cpp/build# ./bin/llama-cli --version
version: 3664 (82e3b03c)

"llm_load_print_meta: model size = 5.41 GiB (5.79 BPW) ..."
I'll try llama-2 for both t-mac and llama.cpp. And use older llama.cpp (b2854) to get the more accurate compare result.
I'll benchmark equivalent single kernel for both two.

Later for llama-bench ...

@idreamerhx
Copy link
Author

idreamerhx commented Sep 6, 2024

/opt/llama.cpp/build# ./bin/llama-bench -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -t 1

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 1 pp512 3.53 ± 0.00
llama 8B Q4_0 4.33 GiB 8.03 B CPU 1 tg128 1.83 ± 0.00

build: 82e3b03c (3664)

/opt/llama.cpp/build# ./bin/llama-bench -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -t 4

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 4 pp512 13.98 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 4 tg128 6.76 ± 0.03

build: 82e3b03c (3664)

./bin/llama-bench -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -t 1; sleep 10; ./bin/llama-bench -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -t 4
[11:32:18] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init

model size params backend threads test t/s
llama 8B IN 5.41 GiB 8.03 B CPU 1 pp 512 2.27 ± 0.00
llama 8B IN 5.41 GiB 8.03 B CPU 1 tg 128 1.93 ± 0.00

build: 70c312d (2854)
[12:00:33] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init

model size params backend threads test t/s
llama 8B IN 5.41 GiB 8.03 B CPU 4 pp 512 8.91 ± 0.00
llama 8B IN 5.41 GiB 8.03 B CPU 4 tg 128 6.74 ± 0.01

build: 70c312d (2854)

I'm downloading some other models w2g128 and QuantFactory/Llama-2-7b .

@idreamerhx
Copy link
Author

./bin/llama-bench -m Llama-2-7b-chat-hf.Q4_0.gguf -t 4

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 pp512 23.63 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 tg128 13.85 ± 0.04

build: 82e3b03c (3664)

./3rdparty/llama.cpp/build/bin/llama-bench -m Llama-2-7b-EfficientQAT-w4g128/ggml-model.in.gguf -t 4
[07:12:55] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init

model size params backend threads test t/s
llama 7B IN 3.69 GiB 6.74 B CPU 4 pp 512 18.78 ± 0.04
llama 7B IN 3.69 GiB 6.74 B CPU 4 tg 128 15.27 ± 0.01

build: 70c312d (2854)

@idreamerhx
Copy link
Author

./bin/llama-bench -m Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf -t 4

model size params backend threads test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B CPU 4 pp512 23.03 ± 0.00
llama 8B Q2_K - Medium 2.95 GiB 8.03 B CPU 4 tg128 16.45 ± 0.00

build: 82e3b03c (3664)

./bin/llama-bench -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w2g128/ggml-model.in.gguf -t 4
[06:39:01] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init

model size params backend threads test t/s
llama 8B IN 3.99 GiB 8.03 B CPU 4 pp 512 33.06 ± 0.04
llama 8B IN 3.99 GiB 8.03 B CPU 4 tg 128 19.32 ± 0.01

build: 70c312d (2854)

@idreamerhx
Copy link
Author

Hi, I found it faster for 2 bit. All the t-mac model generates garbled code.

@idreamerhx
Copy link
Author

idreamerhx commented Sep 10, 2024

llama.cpp commit compared to t-mac (initial)

(base) root@4b6ac2cf95f0:/opt/llama.cpp-70c312d/build# ./bin/llama-bench -m Llama-2-7b-chat-hf.Q2_K.gguf -t 4

model size params backend threads test t/s
llama 7B Q2_K - Medium 2.36 GiB 6.74 B CPU 4 pp 512 23.16 ± 0.13
llama 7B Q2_K - Medium 2.36 GiB 6.74 B CPU 4 tg 128 17.90 ± 0.01

build: unknown (0)

(base) root@4b6ac2cf95f0:/opt/llama.cpp-70c312d/build# ./bin/llama-bench -m Llama-2-7b-chat-hf.Q4_0.gguf -t 4

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 pp 512 19.60 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 tg 128 13.62 ± 0.00

build: unknown (0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants