Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there no difference in the E2E performance of T-MAC and llama.cpp on arm machine? #61

Open
ppp-max opened this issue Oct 11, 2024 · 3 comments

Comments

@ppp-max
Copy link

ppp-max commented Oct 11, 2024

I used a ARM machine to test the end-to-end output, but the performance does not match the results mentioned in the paper. The tested data of llama.cpp and T-MAC is nearly same. I've posted the measured data below.
Image
Image
And the frequency of this machine is 2.5 GHz, the bandwidth of this machine 680 G/s per core.

@kaleid-liner
Copy link
Collaborator

Is 680 G/s memory bandwidth? It seems invalid. You also didn't post the data of llama.cpp. It would be more helpful if you provide the model architecture , whether 4bit or 2bit, and device name.

@ppp-max
Copy link
Author

ppp-max commented Oct 14, 2024

Sorry, the data was pasted wrong. Here‘s llama.cpp's data which used model bitnet_b1_58-3B and thread 4.
Image
Image
And then I tested Llama-2-7b-EfficientQAT-w2g128-GPTQ、Llama-2-7b-EfficientQAT-w4g128-GPTQ, which have the same results(there is no difference of the E2E performance between T-MAC and llama.cpp)
And I computed the bandwidth of this machine again,whis is 340 G/s. Sorry about that.
Look forward to your reply. Thk.

@QingtaoLi1
Copy link
Contributor

QingtaoLi1 commented Nov 19, 2024

@ppp-max Your speed is quite low while the memory bandwidth is strangely high. May I double check that 340 is G bits or G bytes? The speed you provide is close to our Raspberry Pi, while its memory bandwidth is only about 48 GB/s. And do you see obvious speed gap between T-MAC and llama.cpp using one single thread? If that's the case, we tend to consider that the 4 threads case meets memory bound, as the roofline model we show in our main page,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants