Using T-MAC is slower than original llama.cpp #79

xdd130 · 2024-12-18T05:53:47Z

TEST PLATFORM ：AMD R5 7600X

T-MAC Test step：

test model ：
Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4

Compilation instructions：

python tools/run_pipeline.py -o Qwen2.5-3B-Instruct-GPTQ-Int4 -m auto-gptq -q int_n

test instructions:

./3rdparty/llama.cpp/build/bin/llama-bench -m Qwen2.5-3B-Instruct-GPTQ-Int4/ggml-model.int_n.gguf -p 512 -n 128 -t 4

result:

model	size	params	backend	threads	test	t/s
qwen2 ?B INT_N	2.53 GiB	3.40 B	CPU	4	pp512	58.42 ± 0.21
qwen2 ?B INT_N	2.53 GiB	3.40 B	CPU	4	tg128	20.46 ± 0.07

original llama.cpp:

test model:

Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q4_k_m.gguf

Compilation instructions：

mkdir build && cd build
cmake .. -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
cmake --build . --target llama-cli llama-bench --config Release -- -j6

test instructions:

./llama-bench -m qwen2.5-3b-instruct-q4_k_m.gguf -p 512 -n 128 -t 4

result:

model	size	params	backend	threads	test	t/s
qwen2 3B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	pp512	67.33 ± 0.10
qwen2 3B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	tg128	22.72 ± 0.04

From the test results, it seems that using T-MAC has no performance advantage on this machine？What could be the reason for this phenomenon?

The text was updated successfully, but these errors were encountered:

BodhiHu · 2024-12-23T12:37:36Z

Hi @xdd130 , what's your testing OS and hardware config ?

xdd130 · 2024-12-25T08:28:48Z

Hi @BodhiHu
Thanks for your reply
there is my test config：
OS：Ubuntu20.04
CPU：AMD R5 7600X(6cores 12threads)
memory：32GB 6400MHz DDR5

QingtaoLi1 · 2025-01-03T10:36:40Z

@xdd130 Since T-MAC uses another set of instructions (tbl/shuf) compared to multiply-based methods (mul/madd/...), their performance gap can vary according to the CPU. AVX512 in Zen4 may be one of the reasons.

BTW the convert script will keep embedding/output weights FP16, while Q4_K uses smaller types. You can try running llama-quantize with --token-embedding-type q4_k --output-tensor-type q6_k and quant type f16 to further compress the model size.

QingtaoLi1 · 2025-01-08T09:36:37Z

@xdd130 I run the Qwen models you mentioned on my Intel i7-12700 with the following results. I notice that this model has very large vocabulary. The default convert script only convert the weights into I4 but embedding/output weight not, while llama.cpp uses q4k embedding and q6k output. This is the reason why there is a big gap in model size and T-MAC seems slower.

You can use llama-quantize to further convert the two big tensors and see if this makes a difference on your device.

.\3rdparty\llama.cpp\build\bin\Release\llama-quantize.exe --token-embedding-type q4_k --output-tensor-type q6_k D:\models\Qwen2.5-3B-Instruct-GPTQ-Int4\Qwen2.5-3B-Instruct-GPTQ-Int4.INT_N.gguf D:\models\Qwen2.5-3B-Instruct-GPTQ-Int4\Qwen2.5-3B-Instruct-GPTQ-Int4.INT_N_q4k.gguf f16

llama.cpp seems have some optimizations in prefilling recently. We will look into the prefilling gap.

model	size	params	backend	threads	test	t/s
qwen2 ?B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	pp512	51.87 ± 0.96
qwen2 ?B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	tg128	19.72 ± 0.20

model	size	params	backend	threads	test	t/s
qwen2 ?B INT_N	2.53 GiB	3.40 B	CPU	4	pp512	43.79 ± 0.28
qwen2 ?B INT_N	2.53 GiB	3.40 B	CPU	4	tg128	18.43 ± 0.21

model	size	params	backend	threads	test	t/s
qwen2 ?B INT_N Q4_K	1.70 GiB	3.40 B	CPU	4	pp512	42.81 ± 0.58
qwen2 ?B INT_N Q4_K	1.70 GiB	3.40 B	CPU	4	tg128	22.35 ± 0.29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using T-MAC is slower than original llama.cpp #79

Using T-MAC is slower than original llama.cpp #79

xdd130 commented Dec 18, 2024

BodhiHu commented Dec 23, 2024

xdd130 commented Dec 25, 2024

QingtaoLi1 commented Jan 3, 2025

QingtaoLi1 commented Jan 8, 2025 •

edited

Loading

Using T-MAC is slower than original llama.cpp #79

Using T-MAC is slower than original llama.cpp #79

Comments

xdd130 commented Dec 18, 2024

TEST PLATFORM ：AMD R5 7600X

T-MAC Test step：

original llama.cpp:

BodhiHu commented Dec 23, 2024

xdd130 commented Dec 25, 2024

QingtaoLi1 commented Jan 3, 2025

QingtaoLi1 commented Jan 8, 2025 • edited Loading

QingtaoLi1 commented Jan 8, 2025 •

edited

Loading