-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance compared to llama.cpp origin #40
Comments
It's very weird the prefill and decoding phase have nearly the same throughput. #30 also demonstrates that even on a memory bottlenecked machine, the prefill still achieves significant speedup. Can you verify it using
The decoding performance seems to be very close. Can you share the version of your
The model size of EfficientQAT is 5.41 GiB, while the model size of Q4_0 is 4.66 GiB (according to https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF-v2/tree/main). It's mainly because the large token embed of llama.cpp Q4_0 is quantized, while it's not quantized in EfficientQAT (or GPTQ). It could be further optimized. Currently I recommend using models like phi-3.5 or llama-2 for fair comparison. |
Thanks for your replay. /opt/llama.cpp/build# ./bin/llama-cli --version "llm_load_print_meta: model size = 5.41 GiB (5.79 BPW) ..." Later for llama-bench ... |
/opt/llama.cpp/build# ./bin/llama-bench -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -t 1
build: 82e3b03c (3664) /opt/llama.cpp/build# ./bin/llama-bench -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -t 4
build: 82e3b03c (3664) ./bin/llama-bench -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -t 1; sleep 10; ./bin/llama-bench -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -t 4
build: 70c312d (2854)
build: 70c312d (2854) I'm downloading some other models w2g128 and QuantFactory/Llama-2-7b . |
./bin/llama-bench -m Llama-2-7b-chat-hf.Q4_0.gguf -t 4
build: 82e3b03c (3664) ./3rdparty/llama.cpp/build/bin/llama-bench -m Llama-2-7b-EfficientQAT-w4g128/ggml-model.in.gguf -t 4
build: 70c312d (2854) |
./bin/llama-bench -m Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf -t 4
build: 82e3b03c (3664) ./bin/llama-bench -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w2g128/ggml-model.in.gguf -t 4
build: 70c312d (2854) |
Hi, I found it faster for 2 bit. All the t-mac model generates garbled code. |
llama.cpp commit compared to t-mac (initial) (base) root@4b6ac2cf95f0:/opt/llama.cpp-70c312d/build# ./bin/llama-bench -m Llama-2-7b-chat-hf.Q2_K.gguf -t 4
build: unknown (0) (base) root@4b6ac2cf95f0:/opt/llama.cpp-70c312d/build# ./bin/llama-bench -m Llama-2-7b-chat-hf.Q4_0.gguf -t 4
build: unknown (0) |
I have try on two platforms, 12490f with 64G 6400GHz DDR5, EPYC 7302 16C 3.0GHz 128G 3200 DDR4 (memory read 118GB/s)
there is log on 7302, firstly t-mac and secondly for llamacpp latest for model with similar config (llama3 8b instraction 4bit/Q40)
I' post logs for 12490f and also 14700 later. For 12490f is about 3 times slower for prefill.
So, why . What's missing. And how to speedup tk/s especially prefill using t-mac.
#30
Thx.
Running command in /opt/T-MAC/3rdparty/llama.cpp/build:
/opt/T-MAC/3rdparty/llama.cpp/build/bin/main -m /opt/T-MAC/Llama-3-8b-instruct-EfficientQAT-w4g128/ggml-model.in.gguf -n 128 -t 4 -p Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. -ngl 0 -c 2048
Check logs/2024-09-06-08-36-28.log for inference output
(base) root@b96728f2aeed:/opt/T-MAC# cat logs/2024-09-06-08-36-28.log
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2854 (70c312d)
main: built with clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) for x86_64-unknown-linux-gnu
main: seed = 1725611788
[08:36:28] /opt/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init
llm_load_print_meta: model size = 5.41 GiB (5.79 BPW)
llm_load_print_meta: general.name = Llama-3-8b-instruct-EfficientQAT-w4g128
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token = 128256 '[PAD]'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: CPU buffer size = 5749.03 MiB
..................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 4 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 0
<|begin_of_text|>Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.imersimersimersimersushihandsimersimersushiimersaqimersimersimersindeimersindeimerspinsimersellersimersayasimersindeushiimersushiushiimersayasaqimersindeimersindeushiimershandsushiayasimersushiimersindeushiimersindeimersimersorgeimersimersingoimersushiaqimersimersingoimersimersersimersorgeimersimersaqimersimersellersushiimersimersindeimershandsushihandsimersimersimersimersayasellersantimersushiimersimersimersimersimersimersimersimersimersayasushiimersimersimersuggyindeindeimerspinsaqimersimersimersimershandsimersimersorgeiringimersushiimersimersimersaqimersingohandsimershands
llama_print_timings: load time = 1000.39 ms
llama_print_timings: sample time = 10.08 ms / 128 runs ( 0.08 ms per token, 12704.71 tokens per second)
llama_print_timings: prompt eval time = 2035.98 ms / 18 tokens ( 113.11 ms per token, 8.84 tokens per second)
llama_print_timings: eval time = 18743.23 ms / 127 runs ( 147.58 ms per token, 6.78 tokens per second)
llama_print_timings: total time = 20853.05 ms / 145 tokens
Log end
####################################################################
llama.cpp
####################################################
./bin/llama-cli -m Meta-Llama-3-8B-Instruct-v2.Q4_0.gguf -n 128 -t 4 -p 'Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.' -ngl 0 -c 2048
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 4437.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 4 (n_threads_batch = 4) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 0
Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. It is one of the largest and most successful technology companies in the world. Microsoft was founded in 1975 by Bill Gates and Paul Allen, and it is known for its Windows operating system, Microsoft Office software suite, and Xbox gaming console. The company is also known for its cloud computing platform, Azure, and its artificial intelligence (AI) and machine learning (ML) technologies.
Microsoft is a multinational corporation with operations in over 100 countries. It has a market capitalization of over $2 trillion and employs over 140,000 people worldwide. The company is headquartered in Redmond, Washington, and has major offices in New York
llama_print_timings: load time = 684.18 ms
llama_print_timings: sample time = 10.31 ms / 128 runs ( 0.08 ms per token, 12411.52 tokens per second)
llama_print_timings: prompt eval time = 1259.19 ms / 17 tokens ( 74.07 ms per token, 13.50 tokens per second)
llama_print_timings: eval time = 18773.20 ms / 127 runs ( 147.82 ms per token, 6.76 tokens per second)
llama_print_timings: total time = 20089.71 ms / 144 tokens
Log end
The text was updated successfully, but these errors were encountered: