8gen3 T-MAC cpu performance issue #32

AndreaChiChengdu · 2024-08-29T08:30:36Z

hi there, I am using a 8Gen3(Xiaomi14 Pro 68GB/s bw) and following the Android Cross Compilation Guidance Option.1: Use Prebuilt Kernels guide to test llama-2-7b-4bit token generation performance.
it looks that the t-mac cpu performance is worse that NPU,where can i optimize?
thanks

p.s.
1.the phone battery is above 80% with high performance mode. the phone geekbench/ludashi benchmark score are right in 8gen3 range.
2.cmd: python tools/run_pipeline.py -o ~/andreaji/condatmac/T-MAC/3rdparty/llama.cpp/Llama-2-7b-EfficientQAT-w4g128-GPTQ -m llama-2-7b-4bit -d android -ndk $NDK_HOME -u
3.my changes in run_pipeline.py is that add the prompt from 24 token to 256 token words.

Framework Model NUM_THREADS Throughput (tokens/sec)
T-MAC (CPU) llama-2-7b (W4) 2 my data only 4.46 token/s at -n 128
T-MAC (CPU) llama-2-7b (W4) 4 my data only 6.61~8.2 token/s at -n 128
NPE (NPU) llama-2-7b (W4) - 11.3 in qualcomm aihub near the xelite 10.3

kaleid-liner · 2024-08-29T09:07:40Z

This is the data we profiled on OnePlus 12 (Snapdragon 8 GEN 3) with high performance mode.

	T-MAC	llama.cpp	NPU (claimed)
llama-2-7b-2bit (NT=1)	8.05	3.16
llama-2-7b-2bit (NT=2)	10.00	3.76
llama-2-7b-2bit (NT=3)	13.76	5.43
llama-2-7b-2bit (NT=4)	16.62	6.95
llama-2-7b-4bit (NT=1)	4.43	3.44	11.3
llama-2-7b-4bit (NT=2)	5.82	4.67	11.3
llama-2-7b-4bit (NT=3)	8.20	6.66	11.3
llama-2-7b-4bit (NT=4)	10.19	8.24	11.3

8 GEN 3 is more complex compared to X Elite due to its big.LITTLE architecture. he CPU frequency and achieved memory bandwidth differ between the big core and the little core, and most importantly, the CPI (clock per instruction) of LUT or FMA instructions varies between big cores and little cores.

Meanwhile, the task scheduling of llama.cpp threadpool is suboptimal. It will assign the same amount of computations to each core, so it fails to fully utilize the big core under multi-threading. We are currently conducting some low-level profiling and will resolve this issue.

AndreaChiChengdu · 2024-08-29T09:21:37Z

13.35

thank you very much,the data is very useful to me.
i found this repo is base on b2854 and in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel, it looks useful

kaleid-liner · 2024-08-29T09:33:44Z

in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel

Thanks for the info. We are working on merging the latest llama.cpp

kaleid-liner · 2024-09-02T09:45:36Z

@AndreaChiChengdu I've added the updated 2-bit T-MAC data to the table above (due to some profiling issues last time, including overheating and interference of tvmrpc_release.apk). All other results have been successfully reproduced and are as expected. The speedup of 2-bit T-MAC compared to 4-bit T-MAC is now as anticipated (i.e., a 2x speedup). The remaining issue is thread scheduling on Android. I'll address this by merging the latest llama.cpp openmp.

kaleid-liner · 2024-09-02T11:57:01Z

To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:

ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431

We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.

AndreaChiChengdu · 2024-09-03T02:08:58Z

To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:
ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431
We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.

@kaleid-liner Yes, I found the same problem. In addition, the paper mentioned 3bit, but the current engineering practice is mainly 2bit and 4bit, what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?
thanks!

kaleid-liner · 2024-09-03T07:48:20Z

@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.

what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?

The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).

AndreaChiChengdu · 2024-09-03T08:10:35Z

@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.

what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?

The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).

thanks, but in this project from now on , it looks that -m not support llama-2-7b-3bit

kaleid-liner · 2024-09-03T08:30:58Z

@AndreaChiChengdu Yes, cause our integration supports most models through GPTQ format, which currently doesn't provide 3-bit format. We just need a standardized 3-bit packing format. Maybe I can try EQAT 3-bit format.

AndreaChiChengdu changed the title ~~8gen3 T-MAC cpu performance gap with README(compare to npu)~~ 8gen3 T-MAC cpu performance issue Aug 29, 2024

kaleid-liner added the enhancement New feature or request label Aug 29, 2024

kaleid-liner added the good first issue Good for newcomers label Sep 2, 2024

kaleid-liner mentioned this issue Sep 4, 2024

performance on mobile phone such as MTK D9000/D8300 or Qualcomm 8Gen3 #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8gen3 T-MAC cpu performance issue #32

8gen3 T-MAC cpu performance issue #32

AndreaChiChengdu commented Aug 29, 2024 •

edited

Loading

kaleid-liner commented Aug 29, 2024 •

edited

Loading

AndreaChiChengdu commented Aug 29, 2024

kaleid-liner commented Aug 29, 2024

kaleid-liner commented Sep 2, 2024

kaleid-liner commented Sep 2, 2024

AndreaChiChengdu commented Sep 3, 2024 •

edited

Loading

kaleid-liner commented Sep 3, 2024

AndreaChiChengdu commented Sep 3, 2024

kaleid-liner commented Sep 3, 2024

8gen3 T-MAC cpu performance issue #32

8gen3 T-MAC cpu performance issue #32

Comments

AndreaChiChengdu commented Aug 29, 2024 • edited Loading

kaleid-liner commented Aug 29, 2024 • edited Loading

AndreaChiChengdu commented Aug 29, 2024

kaleid-liner commented Aug 29, 2024

kaleid-liner commented Sep 2, 2024

kaleid-liner commented Sep 2, 2024

AndreaChiChengdu commented Sep 3, 2024 • edited Loading

kaleid-liner commented Sep 3, 2024

AndreaChiChengdu commented Sep 3, 2024

kaleid-liner commented Sep 3, 2024

AndreaChiChengdu commented Aug 29, 2024 •

edited

Loading

kaleid-liner commented Aug 29, 2024 •

edited

Loading

AndreaChiChengdu commented Sep 3, 2024 •

edited

Loading