What's Changed
⚡ Faster packing
for post-quantization model weight save.
⚡ Triton
kernel now validated for Intel/XPU
when Intel Triton package is installed.
⚡ New compile()
api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw
calculations.
🐛 Fix ROCm
compile with setup.py
- Fix exllama slow pack() by @CSY-ModelCloud in #1128
- use optimized torch.round() codes by @CSY-ModelCloud in #1131
- fix shape mismatch for packing by @CSY-ModelCloud in #1132
- Speed up triton dequant by @Qubitium in #1136
- add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
- disable sampling by @Qubitium in #1141
- mod triton-xpu by @CL-ModelCloud in #1135
- supress dynamo error by @CSY-ModelCloud in #1143
- fix bpw by @CL-ModelCloud in #1150
- [FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
- Add mod chat by @CL-ModelCloud in #1154
- optimize pack by @Qubitium in #1153
- add quant time test by @CL-ModelCloud in #1155
- Export to hf model by @LRL-ModelCloud in #1157
- Fix bpw calculation by @Qubitium in #1163
- Inference speed test by @CL-ModelCloud in #1159
New Contributors
Full Changelog: v1.7.3...v1.7.4