How to Fully Utilize the Optimized Performance of T-MAC ? #30

ma-hang · 2024-08-28T07:10:10Z

I followed the documentation to run the llama2-7b model (4-bit quantized) and also ran it on llama.cpp for comparison. I noticed that, except for nt=1, where there was a slight performance improvement, the performance with nt=4/8 was actually worse than with llama.cpp. The command and parameters used were: python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 1. It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase.

Output sample:
python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 4
Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. Home to some of the m>Microsoft Office 365 (MSO365) is the world’s most popular office suite, used by more than 180 million users. Microsoft Office 365 is a cloud-based su>Microsoft Office 365 is a cloud-based suite of productivity applications that includes Microsoft Office, Exchange, SharePoint, and Skype for Business>llama_print_timings: load time = 600.02 ms
llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40920.72 tokens per second)
llama_print_timings: prompt eval time = 1213.28 ms / 24 tokens ( 50.55 ms per token, 19.78 tokens per second)
llama_print_timings: eval time = 19007.30 ms / 127 runs ( 149.66 ms per token, 6.68 tokens per second)
llama_print_timings: total time = 20237.61 ms / 151 tokens
Log end

kaleid-liner · 2024-08-28T08:16:42Z

Are you using devices that are included in our profiling? If not, could you share the specifics of your platform? Based on our observations on some old generation devices (particularly AVX2 CPUs), there could be several potential causes:

Restricted memory bandwidth: If the memory bandwidth of the platform being tested is extremely low (for instance, 10~30 GB/s), the inference will be completely memory bottlenecked. This scenario can occur on older PCs equipped with 1~2 channel DDR4/DDR5 memory.
For Intel CPUs prior to Icelake, the CPI of the pshuf instruction is twice as slow (see here), which could harm the performance of T-MAC.

However, modern edge devices come equipped with higher and higher memory bandwidth. For example, up to 74 GB/s for mobile phones equipped with Snapdragon 8 GEN 3, 135GB/s for laptop equipped with Snapdragon X Elite, and even 800GB/s for M2-Ultra. Moreover, even on the old generation devices mentioned above, T-MAC should still offer a significant speedup for 2-bit.

kaleid-liner · 2024-08-28T08:28:57Z

It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase

This is a clear clue that the decoding is bottlenecked by the memory bandwidth.

microsoft deleted a comment from ViniciusSCG Aug 28, 2024

kaleid-liner added the question Further information is requested label Aug 28, 2024

idreamerhx mentioned this issue Sep 6, 2024

Slow performance compared to llama.cpp origin #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Fully Utilize the Optimized Performance of T-MAC ? #30

How to Fully Utilize the Optimized Performance of T-MAC ? #30

ma-hang commented Aug 28, 2024 •

edited

Loading

kaleid-liner commented Aug 28, 2024 •

edited

Loading

kaleid-liner commented Aug 28, 2024 •

edited

Loading

How to Fully Utilize the Optimized Performance of T-MAC ? #30

How to Fully Utilize the Optimized Performance of T-MAC ? #30

Comments

ma-hang commented Aug 28, 2024 • edited Loading

kaleid-liner commented Aug 28, 2024 • edited Loading

kaleid-liner commented Aug 28, 2024 • edited Loading

ma-hang commented Aug 28, 2024 •

edited

Loading

kaleid-liner commented Aug 28, 2024 •

edited

Loading

kaleid-liner commented Aug 28, 2024 •

edited

Loading