-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using T-MAC is slower than original llama.cpp #79
Comments
Hi @xdd130 , what's your testing OS and hardware config ? |
Hi @BodhiHu |
@xdd130 Since T-MAC uses another set of instructions (tbl/shuf) compared to multiply-based methods (mul/madd/...), their performance gap can vary according to the CPU. AVX512 in Zen4 may be one of the reasons. BTW the convert script will keep embedding/output weights FP16, while Q4_K uses smaller types. You can try running |
@xdd130 I run the Qwen models you mentioned on my Intel i7-12700 with the following results. I notice that this model has very large vocabulary. The default convert script only convert the weights into I4 but embedding/output weight not, while llama.cpp uses q4k embedding and q6k output. This is the reason why there is a big gap in model size and T-MAC seems slower. You can use
llama.cpp seems have some optimizations in prefilling recently. We will look into the prefilling gap.
|
TEST PLATFORM :AMD R5 7600X
T-MAC Test step:
test model :
Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4
Compilation instructions:
test instructions:
result:
original llama.cpp:
test model:
Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q4_k_m.gguf
Compilation instructions:
test instructions:
result:
From the test results, it seems that using T-MAC has no performance advantage on this machine?What could be the reason for this phenomenon?
The text was updated successfully, but these errors were encountered: