Skip to content

Releases: ModelCloud/GPTQModel

GPTQModel v1.7.4

26 Jan 07:02
b623b96
Compare
Choose a tag to compare

What's Changed

⚡ Faster packing for post-quantization model weight save.
Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py

New Contributors

Full Changelog: v1.7.3...v1.7.4

GPTQModel v1.7.3

21 Jan 00:14
5c1a7e8
Compare
Choose a tag to compare

What's Changed

⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.

New Contributors

Full Changelog: v1.7.2...v1.7.3

GPTQModel v1.7.2

19 Jan 03:52
d762379
Compare
Choose a tag to compare

What's Changed

⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.

Full Changelog: v1.7.0...v1.7.2

GPTQModel v1.7.0

17 Jan 01:34
d247fd0
Compare
Choose a tag to compare

What's Changed

backend.MLX added for runtime-conversion and execution of GPTQ models on Apple's MLX framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py not correctly detecting incompatible setuptools/wheel pkgs.

Full Changelog: v1.6.1...v1.7.0

GPTQModel v1.6.1

09 Jan 03:40
0c6452b
Compare
Choose a tag to compare

What's Changed

🎉 New OpenAI api compatible end-point via model.serve(host, port).
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False loading regression.

Full Changelog: v1.6.0...v1.6.1

GPTQModel v1.6.0

06 Jan 08:00
c5c2677
Compare
Choose a tag to compare

What's Changed

⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.

Full Changelog: v1.5.1...v1.6.0

GPTQModel v1.5.1

01 Jan 08:39
4f18747
Compare
Choose a tag to compare

What's Changed

🎉 2025!

⚡ Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.
💫 Improve QuantLinear selection from optimum.
🐛 Fix attn_implementation_autoset compat in latest transformers.

Full Changelog: v1.5.0...v1.5.1

GPTQModel v1.5.0

24 Dec 02:01
4197cd8
Compare
Choose a tag to compare

What's Changed

⚡ Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
🐛 Fixed Qwen 2-VL model quantization vram usage and post-quant file copy of relevant config files.
🐛 Fixed install/compilations in envs with wrong TORCH_CUDA_ARCH_LIST set (Nvidia docker images)
🐛 Warn about bad torch[cuda] install on Windows

Full Changelog: v1.4.5...v1.5.0

GPTQModel v1.4.5

19 Dec 12:16
9012892
Compare
Choose a tag to compare

What's Changed

⚡ Windows 11 support added/validated with DynamicCuda and Torch kernels.
⚡ Ovis 1.6 VL model support with image data calibration.
⚡ Reduced quantization vram usage.
🐛 Fixed dynamic controlled layer loading logic

Full Changelog: v1.4.4...v1.4.5

GPTQModel v1.4.4 Patch

17 Dec 14:48
92266fa
Compare
Choose a tag to compare

What's Changed

⚡ Reduced memory usage during quantization
⚡ Fix device_map={"":"auto"} compat

Full Changelog: v1.4.2...v1.4.4