Releases: ModelCloud/GPTQModel
GPTQModel v1.7.4
What's Changed
⚡ Faster packing
for post-quantization model weight save.
⚡ Triton
kernel now validated for Intel/XPU
when Intel Triton package is installed.
⚡ New compile()
api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw
calculations.
🐛 Fix ROCm
compile with setup.py
- Fix exllama slow pack() by @CSY-ModelCloud in #1128
- use optimized torch.round() codes by @CSY-ModelCloud in #1131
- fix shape mismatch for packing by @CSY-ModelCloud in #1132
- Speed up triton dequant by @Qubitium in #1136
- add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
- disable sampling by @Qubitium in #1141
- mod triton-xpu by @CL-ModelCloud in #1135
- supress dynamo error by @CSY-ModelCloud in #1143
- fix bpw by @CL-ModelCloud in #1150
- [FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
- Add mod chat by @CL-ModelCloud in #1154
- optimize pack by @Qubitium in #1153
- add quant time test by @CL-ModelCloud in #1155
- Export to hf model by @LRL-ModelCloud in #1157
- Fix bpw calculation by @Qubitium in #1163
- Inference speed test by @CL-ModelCloud in #1159
New Contributors
Full Changelog: v1.7.3...v1.7.4
GPTQModel v1.7.3
What's Changed
⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.
- Add util.tensor_parameters() by @ZX-ModelCloud in #1107
- add require_dtype by @LRL-ModelCloud in #1109
- [MODEL] Add Telechat2 (China Telecom) by @1096125073 in #1106
- [FIX] Filter weight-sharing tensors when save by @ZX-ModelCloud in #1112
- Add telechat test by @LRL-ModelCloud in #1111
- [FIX] fix convert_gptq_to_mlx_weights by @LRL-ModelCloud in #1113
- add test_parameter_count.py by @ZX-ModelCloud in #1115
- Add gpqa eval task by @CL-ModelCloud in #1117
- [FIX] Call tied_weights() after load_checkpoint_in_model() by @ZX-ModelCloud in #1119
- add phimoe support by @CSY-ModelCloud in #1118
New Contributors
- @1096125073 made their first contribution in #1106
Full Changelog: v1.7.2...v1.7.3
GPTQModel v1.7.2
What's Changed
⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.
- remove catching module error by @CSY-ModelCloud in #1088
- [FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by @LRL-ModelCloud in #1090
- [FIX] monkey patch only once by @LRL-ModelCloud in #1091
- check CC >= 8 for marlin, fixed #1092 by @CSY-ModelCloud in #1093
- check compute capability for marlin in validate_device() by @CSY-ModelCloud in #1095
- torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by @CSY-ModelCloud in #1096
- fix local model path & marlin test by @CSY-ModelCloud in #1097
- mod bits info by @CL-ModelCloud in #1100
- Reduce memory usage in mlx conversion by @Qubitium in #1099
- cleanup mlx code by @Qubitium in #1101
Full Changelog: v1.7.0...v1.7.2
GPTQModel v1.7.0
What's Changed
⚡backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple's MLX
framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py
not correctly detecting incompatible setuptools
/wheel
pkgs.
- [CI] run tests with linux tag by @CSY-ModelCloud in #1067
- Add backend.MLX by @LRL-ModelCloud in #1061
- add mlx generate test by @CL-ModelCloud in #1069
- [CI] upload source in build step by @CSY-ModelCloud in #1070
- code review by @CL-ModelCloud in #1072
- [CI] install mlx by @CSY-ModelCloud in #1071
- Add option to quantize
lm_head
by @ZX-ModelCloud in #1037 - fix test_packing by @LRL-ModelCloud in #1073
- [CI] add mlx test by @CSY-ModelCloud in #1074
- [CI] fix ci relase env name by @CSY-ModelCloud in #1078
- update mlx test by @CSY-ModelCloud in #1079
- convert to mlx support desc_act true by @LRL-ModelCloud in #1082
- [CI] add extra-index-url for pip install by @CSY-ModelCloud in #1083
- catch module error for setup.py by @CSY-ModelCloud in #1084
Full Changelog: v1.6.1...v1.7.0
GPTQModel v1.6.1
What's Changed
🎉 New OpenAI api compatible end-point via model.serve(host, port)
.
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False
loading regression.
- code opt by @CL-ModelCloud in #1038
- fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
- add global rocm check by @CSY-ModelCloud in #1043
- [FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
- enable flash attn for loading quantized by @CSY-ModelCloud in #1045
- add flash_attn2 test by @CSY-ModelCloud in #1047
- enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
- move flash attn test to correct folder by @CSY-ModelCloud in #1052
- Expose openai server api by @CL-ModelCloud in #1048
- update openai server by @CL-ModelCloud in #1058
- don't download whl for xpu env by @CSY-ModelCloud in #1059
- remove build tag for normal release by @CSY-ModelCloud in #1063
- disable flash attn 2 for internlm by @CSY-ModelCloud in #1065
Full Changelog: v1.6.0...v1.6.1
GPTQModel v1.6.0
What's Changed
⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.
- note about
batch_size
to speed up quant by @Qubitium in #992 - Add ROCm support by @CSY-ModelCloud in #993
- Add bits test by @ZX-ModelCloud in #995
- note about rocm support by @Qubitium in #998
- [FIX] wrong variable name by @ZX-ModelCloud in #997
- update rocm version tag by @CSY-ModelCloud in #999
- Auto-tokenizer will be called within
load()
by @LRL-ModelCloud in #996 - update transformers by @Qubitium in #1001
- [FIX] torch qlinear forward by @ZX-ModelCloud in #1002
- cleanup marlin info by @Qubitium in #1004
- Use custom forward hook by @LRL-ModelCloud in #1003
- fix hooked linear init by @LRL-ModelCloud in #1011
- add HookedConv1D by @LRL-ModelCloud in #1012
- record fwd time by @LRL-ModelCloud in #1013
- add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by @CSY-ModelCloud in #1015
- [FIX] quantize_config could not read from config.json by @ZX-ModelCloud in #1022
- Fix quant time by @LRL-ModelCloud in #1025
- fix forward hook by @LRL-ModelCloud in #1027
- Fix hooked conv2d by @LRL-ModelCloud in #1030
- clean cache by @CL-ModelCloud in #1032
Full Changelog: v1.5.1...v1.6.0
GPTQModel v1.5.1
What's Changed
🎉 2025!
⚡ Added QuantizeConfig.device
to clearly define which device is used for quantization: default = auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device
during quantization to minimize vram usage.
💫 Improve QuantLinear
selection from optimum
.
🐛 Fix attn_implementation_autoset
compat in latest transformers.
- Add QuantizeConfig.device and use. by @Qubitium in #950
- fix hf_select_quant_linear by @LRL-ModelCloud in #966
- update vllm gptq_marlin code by @ZX-ModelCloud in #967
- fix cuda:0 not a enum device by @CSY-ModelCloud in #968
- fix marlin info for non-cuda device by @Qubitium in #972
- fix backend str bug by @CL-ModelCloud in #973
- hf select quant_linear with pack by @LRL-ModelCloud in #969
- remove auto select BACKEND.IPEX by @CSY-ModelCloud in #975
- fix autoround received a device_map by @CSY-ModelCloud in #976
- use enum instead of magic number by @CSY-ModelCloud in #979
- use new ci docker images by @CSY-ModelCloud in #980
- fix flash attntion was auto loaded on cpu for pretrained model by @CSY-ModelCloud in #981
- fix old transformer doesn't have _attn_implementation_autoset by @CSY-ModelCloud in #982
- fix gptbigcode test temporally by @CSY-ModelCloud in #983
- fix version parsing by @CSY-ModelCloud in #985
Full Changelog: v1.5.0...v1.5.1
GPTQModel v1.5.0
What's Changed
⚡ Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
🐛 Fixed Qwen 2-VL model quantization vram usage and post-quant file copy of relevant config files.
🐛 Fixed install/compilations in envs with wrong TORCH_CUDA_ARCH_LIST set (Nvidia docker images)
🐛 Warn about bad torch[cuda] install on Windows
- Fix backend not ipex by @CSY-ModelCloud in #930
- Fix broken ipex check by @Qubitium in #933
- Fix dynamic_cuda validation by @CSY-ModelCloud in #936
- Fix bdist_wheel does not exist on old setuptools by @CSY-ModelCloud in #939
- Add cuda warning on windows by @CSY-ModelCloud in #942
- Add torch inference benchmark by @CL-ModelCloud in #940
- Add
modality
toBaseModel
by @ZX-ModelCloud in #937 - [FIX] qwen_vl_utils should be locally import by @ZX-ModelCloud in #946
- Filter torch cuda arch < 6.0 by @CSY-ModelCloud in #955
- [FIX] wrong filepath was used when model_id_or_path was hugging model id by @ZX-ModelCloud in #956
- Fix import error was not caught by @CSY-ModelCloud in #961
Full Changelog: v1.4.5...v1.5.0
GPTQModel v1.4.5
What's Changed
⚡ Windows 11 support added/validated with DynamicCuda
and Torch
kernels.
⚡ Ovis 1.6 VL model support with image data calibration.
⚡ Reduced quantization vram usage.
🐛 Fixed dynamic
controlled layer loading logic
- Refractor by @Qubitium in #895
- Add platform check by @LRL-ModelCloud in #899
- Exclude marlin & exllama on windows by @CSY-ModelCloud in #898
- Remove unnecessary backslash in the expression & typehint by @CSY-ModelCloud in #903
- Add DEVICE.ALL by @LRL-ModelCloud in #901
- [FIX] the error of loading quantized model with dynamic by @ZX-ModelCloud in #907
- [FIX] gpt2 quantize error by @ZX-ModelCloud in #912
- Simplify checking generated str for vllm test & fix transformers version for cohere2 by @CSY-ModelCloud in #914
- [MODEL] add OVIS support by @ZX-ModelCloud in #685
- Fix IDE warning marlin not in all by @CSY-ModelCloud in #920
Full Changelog: v1.4.4...v1.4.5
GPTQModel v1.4.4 Patch
What's Changed
⚡ Reduced memory usage during quantization
⚡ Fix device_map={"":"auto"}
compat
- Speed up unit tests by @Qubitium in #885
- [FIX] hf select quant linear parse device map by @ZX-ModelCloud in #887
- Avoid cloning on gpu by @Qubitium in #886
- Expose hf_quantize() by @ZX-ModelCloud in #888
- Update integration hf code by @ZX-ModelCloud in #891
- Add back fasterquant() for compat by @Qubitium in #892
Full Changelog: v1.4.2...v1.4.4