-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8gen3 T-MAC cpu performance issue #32
Comments
This is the data we profiled on OnePlus 12 (Snapdragon 8 GEN 3) with high performance mode.
8 GEN 3 is more complex compared to X Elite due to its big.LITTLE architecture. he CPU frequency and achieved memory bandwidth differ between the big core and the little core, and most importantly, the CPI (clock per instruction) of LUT or FMA instructions varies between big cores and little cores. Meanwhile, the task scheduling of llama.cpp threadpool is suboptimal. It will assign the same amount of computations to each core, so it fails to fully utilize the big core under multi-threading. We are currently conducting some low-level profiling and will resolve this issue. |
thank you very much,the data is very useful to me. |
Thanks for the info. We are working on merging the latest llama.cpp |
@AndreaChiChengdu I've added the updated 2-bit T-MAC data to the table above (due to some profiling issues last time, including overheating and interference of tvmrpc_release.apk). All other results have been successfully reproduced and are as expected. The speedup of 2-bit T-MAC compared to 4-bit T-MAC is now as anticipated (i.e., a 2x speedup). The remaining issue is thread scheduling on Android. I'll address this by merging the latest llama.cpp openmp. |
To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:
We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp. |
@kaleid-liner Yes, I found the same problem. In addition, the paper mentioned 3bit, but the current engineering practice is mainly 2bit and 4bit, what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf? |
@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.
The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago). |
|
@AndreaChiChengdu Yes, cause our integration supports most models through GPTQ format, which currently doesn't provide 3-bit format. We just need a standardized 3-bit packing format. Maybe I can try EQAT 3-bit format. |
hi there, I am using a 8Gen3(Xiaomi14 Pro 68GB/s bw) and following the Android Cross Compilation Guidance Option.1: Use Prebuilt Kernels guide to test llama-2-7b-4bit token generation performance.
it looks that the t-mac cpu performance is worse that NPU,where can i optimize?
thanks
p.s.
1.the phone battery is above 80% with high performance mode. the phone geekbench/ludashi benchmark score are right in 8gen3 range.
2.cmd: python tools/run_pipeline.py -o ~/andreaji/condatmac/T-MAC/3rdparty/llama.cpp/Llama-2-7b-EfficientQAT-w4g128-GPTQ -m llama-2-7b-4bit -d android -ndk $NDK_HOME -u
3.my changes in run_pipeline.py is that add the prompt from 24 token to 256 token words.
Framework Model NUM_THREADS Throughput (tokens/sec)
T-MAC (CPU) llama-2-7b (W4) 2 my data only 4.46 token/s at -n 128
T-MAC (CPU) llama-2-7b (W4) 4 my data only 6.61~8.2 token/s at -n 128
NPE (NPU) llama-2-7b (W4) - 11.3 in qualcomm aihub near the xelite 10.3
The text was updated successfully, but these errors were encountered: