You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I followed the documentation to run the llama2-7b model (4-bit quantized) and also ran it on llama.cpp for comparison. I noticed that, except for nt=1, where there was a slight performance improvement, the performance with nt=4/8 was actually worse than with llama.cpp. The command and parameters used were: python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 1. It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase.
Output sample:
python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 4 Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. Home to some of the m>Microsoft Office 365 (MSO365) is the world’s most popular office suite, used by more than 180 million users. Microsoft Office 365 is a cloud-based su>Microsoft Office 365 is a cloud-based suite of productivity applications that includes Microsoft Office, Exchange, SharePoint, and Skype for Business>llama_print_timings: load time = 600.02 ms
llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40920.72 tokens per second)
llama_print_timings: prompt eval time = 1213.28 ms / 24 tokens ( 50.55 ms per token, 19.78 tokens per second)
llama_print_timings: eval time = 19007.30 ms / 127 runs ( 149.66 ms per token, 6.68 tokens per second)
llama_print_timings: total time = 20237.61 ms / 151 tokens
Log end
The text was updated successfully, but these errors were encountered:
Are you using devices that are included in our profiling? If not, could you share the specifics of your platform? Based on our observations on some old generation devices (particularly AVX2 CPUs), there could be several potential causes:
Restricted memory bandwidth: If the memory bandwidth of the platform being tested is extremely low (for instance, 10~30 GB/s), the inference will be completely memory bottlenecked. This scenario can occur on older PCs equipped with 1~2 channel DDR4/DDR5 memory.
For Intel CPUs prior to Icelake, the CPI of the pshuf instruction is twice as slow (see here), which could harm the performance of T-MAC.
However, modern edge devices come equipped with higher and higher memory bandwidth. For example, up to 74 GB/s for mobile phones equipped with Snapdragon 8 GEN 3, 135GB/s for laptop equipped with Snapdragon X Elite, and even 800GB/s for M2-Ultra. Moreover, even on the old generation devices mentioned above, T-MAC should still offer a significant speedup for 2-bit.
It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase
This is a clear clue that the decoding is bottlenecked by the memory bandwidth.
I followed the documentation to run the llama2-7b model (4-bit quantized) and also ran it on llama.cpp for comparison. I noticed that, except for nt=1, where there was a slight performance improvement, the performance with nt=4/8 was actually worse than with llama.cpp. The command and parameters used were: python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 1. It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase.
Output sample:
python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 4
Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. Home to some of the m>Microsoft Office 365 (MSO365) is the world’s most popular office suite, used by more than 180 million users. Microsoft Office 365 is a cloud-based su>Microsoft Office 365 is a cloud-based suite of productivity applications that includes Microsoft Office, Exchange, SharePoint, and Skype for Business>llama_print_timings: load time = 600.02 msllama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40920.72 tokens per second)
llama_print_timings: prompt eval time = 1213.28 ms / 24 tokens ( 50.55 ms per token, 19.78 tokens per second)
llama_print_timings: eval time = 19007.30 ms / 127 runs ( 149.66 ms per token, 6.68 tokens per second)
llama_print_timings: total time = 20237.61 ms / 151 tokens
Log end
The text was updated successfully, but these errors were encountered: