Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8gen3 T-MAC cpu performance issue #32

Open
AndreaChiChengdu opened this issue Aug 29, 2024 · 9 comments
Open

8gen3 T-MAC cpu performance issue #32

AndreaChiChengdu opened this issue Aug 29, 2024 · 9 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@AndreaChiChengdu
Copy link

AndreaChiChengdu commented Aug 29, 2024

hi there, I am using a 8Gen3(Xiaomi14 Pro 68GB/s bw) and following the Android Cross Compilation Guidance Option.1: Use Prebuilt Kernels guide to test llama-2-7b-4bit token generation performance.
it looks that the t-mac cpu performance is worse that NPU,where can i optimize?
thanks

p.s.
1.the phone battery is above 80% with high performance mode. the phone geekbench/ludashi benchmark score are right in 8gen3 range.
2.cmd: python tools/run_pipeline.py -o ~/andreaji/condatmac/T-MAC/3rdparty/llama.cpp/Llama-2-7b-EfficientQAT-w4g128-GPTQ -m llama-2-7b-4bit -d android -ndk $NDK_HOME -u
3.my changes in run_pipeline.py is that add the prompt from 24 token to 256 token words.

Framework Model NUM_THREADS Throughput (tokens/sec)
T-MAC (CPU) llama-2-7b (W4) 2 my data only 4.46 token/s at -n 128
T-MAC (CPU) llama-2-7b (W4) 4 my data only 6.61~8.2 token/s at -n 128
NPE (NPU) llama-2-7b (W4) - 11.3 in qualcomm aihub near the xelite 10.3
Screenshot from 2024-08-29 16-07-35

@AndreaChiChengdu AndreaChiChengdu changed the title 8gen3 T-MAC cpu performance gap with README(compare to npu) 8gen3 T-MAC cpu performance issue Aug 29, 2024
@kaleid-liner
Copy link
Collaborator

kaleid-liner commented Aug 29, 2024

This is the data we profiled on OnePlus 12 (Snapdragon 8 GEN 3) with high performance mode.

  T-MAC llama.cpp NPU (claimed)
llama-2-7b-2bit (NT=1) 8.05 3.16  
llama-2-7b-2bit (NT=2) 10.00 3.76  
llama-2-7b-2bit (NT=3) 13.76 5.43  
llama-2-7b-2bit (NT=4) 16.62 6.95  
llama-2-7b-4bit (NT=1) 4.43 3.44 11.3
llama-2-7b-4bit (NT=2) 5.82 4.67 11.3
llama-2-7b-4bit (NT=3) 8.20 6.66 11.3
llama-2-7b-4bit (NT=4) 10.19 8.24 11.3

8 GEN 3 is more complex compared to X Elite due to its big.LITTLE architecture. he CPU frequency and achieved memory bandwidth differ between the big core and the little core, and most importantly, the CPI (clock per instruction) of LUT or FMA instructions varies between big cores and little cores.

Meanwhile, the task scheduling of llama.cpp threadpool is suboptimal. It will assign the same amount of computations to each core, so it fails to fully utilize the big core under multi-threading. We are currently conducting some low-level profiling and will resolve this issue.

@kaleid-liner kaleid-liner added the enhancement New feature or request label Aug 29, 2024
@AndreaChiChengdu
Copy link
Author

13.35

thank you very much,the data is very useful to me.
i found this repo is base on b2854 and in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel, it looks useful

@kaleid-liner
Copy link
Collaborator

in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel

Thanks for the info. We are working on merging the latest llama.cpp

@kaleid-liner
Copy link
Collaborator

@AndreaChiChengdu I've added the updated 2-bit T-MAC data to the table above (due to some profiling issues last time, including overheating and interference of tvmrpc_release.apk). All other results have been successfully reproduced and are as expected. The speedup of 2-bit T-MAC compared to 4-bit T-MAC is now as anticipated (i.e., a 2x speedup). The remaining issue is thread scheduling on Android. I'll address this by merging the latest llama.cpp openmp.

@kaleid-liner kaleid-liner added the good first issue Good for newcomers label Sep 2, 2024
@kaleid-liner
Copy link
Collaborator

To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:

ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431

We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.

@AndreaChiChengdu
Copy link
Author

AndreaChiChengdu commented Sep 3, 2024

To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:

ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431

We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.

@kaleid-liner Yes, I found the same problem. In addition, the paper mentioned 3bit, but the current engineering practice is mainly 2bit and 4bit, what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?
thanks!

@kaleid-liner
Copy link
Collaborator

@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.

what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?

The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).

@AndreaChiChengdu
Copy link
Author

@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.

what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?

The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).

Screenshot from 2024-09-03 16-04-53
thanks, but in this project from now on , it looks that -m not support llama-2-7b-3bit

@kaleid-liner
Copy link
Collaborator

@AndreaChiChengdu Yes, cause our integration supports most models through GPTQ format, which currently doesn't provide 3-bit format. We just need a standardized 3-bit packing format. Maybe I can try EQAT 3-bit format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants