Release GPTQModel v1.7.4 · ModelCloud/GPTQModel

What's Changed

⚡ Faster packing for post-quantization model weight save.
⚡ Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py

Fix exllama slow pack() by @CSY-ModelCloud in #1128
use optimized torch.round() codes by @CSY-ModelCloud in #1131
fix shape mismatch for packing by @CSY-ModelCloud in #1132
Speed up triton dequant by @Qubitium in #1136
add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
disable sampling by @Qubitium in #1141
mod triton-xpu by @CL-ModelCloud in #1135
supress dynamo error by @CSY-ModelCloud in #1143
fix bpw by @CL-ModelCloud in #1150
[FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
Add mod chat by @CL-ModelCloud in #1154
optimize pack by @Qubitium in #1153
add quant time test by @CL-ModelCloud in #1155
Export to hf model by @LRL-ModelCloud in #1157
Fix bpw calculation by @Qubitium in #1163
Inference speed test by @CL-ModelCloud in #1159

New Contributors

@isaranto made their first contribution in #1162

Full Changelog: v1.7.3...v1.7.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQModel v1.7.4

What's Changed

New Contributors

Contributors