Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Fully Utilize the Optimized Performance of T-MAC ? #30

Open
ma-hang opened this issue Aug 28, 2024 · 2 comments
Open

How to Fully Utilize the Optimized Performance of T-MAC ? #30

ma-hang opened this issue Aug 28, 2024 · 2 comments
Labels
question Further information is requested

Comments

@ma-hang
Copy link

ma-hang commented Aug 28, 2024

I followed the documentation to run the llama2-7b model (4-bit quantized) and also ran it on llama.cpp for comparison. I noticed that, except for nt=1, where there was a slight performance improvement, the performance with nt=4/8 was actually worse than with llama.cpp. The command and parameters used were: python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 1. It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase.

Output sample:
python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 4
Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. Home to some of the m>Microsoft Office 365 (MSO365) is the world’s most popular office suite, used by more than 180 million users. Microsoft Office 365 is a cloud-based su>Microsoft Office 365 is a cloud-based suite of productivity applications that includes Microsoft Office, Exchange, SharePoint, and Skype for Business>llama_print_timings: load time = 600.02 ms
llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40920.72 tokens per second)
llama_print_timings: prompt eval time = 1213.28 ms / 24 tokens ( 50.55 ms per token, 19.78 tokens per second)
llama_print_timings: eval time = 19007.30 ms / 127 runs ( 149.66 ms per token, 6.68 tokens per second)
llama_print_timings: total time = 20237.61 ms / 151 tokens
Log end

@microsoft microsoft deleted a comment from ViniciusSCG Aug 28, 2024
@kaleid-liner kaleid-liner added the question Further information is requested label Aug 28, 2024
@kaleid-liner
Copy link
Collaborator

kaleid-liner commented Aug 28, 2024

Are you using devices that are included in our profiling? If not, could you share the specifics of your platform? Based on our observations on some old generation devices (particularly AVX2 CPUs), there could be several potential causes:

  1. Restricted memory bandwidth: If the memory bandwidth of the platform being tested is extremely low (for instance, 10~30 GB/s), the inference will be completely memory bottlenecked. This scenario can occur on older PCs equipped with 1~2 channel DDR4/DDR5 memory.

  2. For Intel CPUs prior to Icelake, the CPI of the pshuf instruction is twice as slow (see here), which could harm the performance of T-MAC.

However, modern edge devices come equipped with higher and higher memory bandwidth. For example, up to 74 GB/s for mobile phones equipped with Snapdragon 8 GEN 3, 135GB/s for laptop equipped with Snapdragon X Elite, and even 800GB/s for M2-Ultra. Moreover, even on the old generation devices mentioned above, T-MAC should still offer a significant speedup for 2-bit.

@kaleid-liner
Copy link
Collaborator

kaleid-liner commented Aug 28, 2024

It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase

This is a clear clue that the decoding is bottlenecked by the memory bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants