Releases · ModelCloud/GPTQModel

26 Jan 07:02

Qubitium

v1.7.4

b623b96

GPTQModel v1.7.4 Latest

Latest

What's Changed

⚡ Faster packing for post-quantization model weight save.
⚡ Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py

Fix exllama slow pack() by @CSY-ModelCloud in #1128
use optimized torch.round() codes by @CSY-ModelCloud in #1131
fix shape mismatch for packing by @CSY-ModelCloud in #1132
Speed up triton dequant by @Qubitium in #1136
add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
disable sampling by @Qubitium in #1141
mod triton-xpu by @CL-ModelCloud in #1135
supress dynamo error by @CSY-ModelCloud in #1143
fix bpw by @CL-ModelCloud in #1150
[FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
Add mod chat by @CL-ModelCloud in #1154
optimize pack by @Qubitium in #1153
add quant time test by @CL-ModelCloud in #1155
Export to hf model by @LRL-ModelCloud in #1157
Fix bpw calculation by @Qubitium in #1163
Inference speed test by @CL-ModelCloud in #1159

New Contributors

@isaranto made their first contribution in #1162

Full Changelog: v1.7.3...v1.7.4

Contributors

Qubitium, isaranto, and 3 other contributors

Assets 52

gptqmodel-1.7.4+cu118torch2.0-cp310-cp310-linux_x86_64.whl

32.4 MB 2025-01-26T07:38:22Z
gptqmodel-1.7.4+cu118torch2.0-cp311-cp311-linux_x86_64.whl

32.4 MB 2025-01-26T07:46:51Z
gptqmodel-1.7.4+cu118torch2.0-cp39-cp39-linux_x86_64.whl

32.4 MB 2025-01-26T07:40:29Z
gptqmodel-1.7.4+cu118torch2.1-cp310-cp310-linux_x86_64.whl

32.5 MB 2025-01-26T07:37:22Z
gptqmodel-1.7.4+cu118torch2.1-cp311-cp311-linux_x86_64.whl

32.5 MB 2025-01-26T07:32:34Z
gptqmodel-1.7.4+cu118torch2.1-cp39-cp39-linux_x86_64.whl

32.4 MB 2025-01-26T07:41:37Z
gptqmodel-1.7.4+cu118torch2.2-cp310-cp310-linux_x86_64.whl

32.2 MB 2025-01-26T08:18:23Z
gptqmodel-1.7.4+cu118torch2.2-cp311-cp311-linux_x86_64.whl

32.3 MB 2025-01-26T08:18:40Z
gptqmodel-1.7.4+cu118torch2.2-cp312-cp312-linux_x86_64.whl

32.3 MB 2025-01-26T08:03:40Z
gptqmodel-1.7.4+cu118torch2.2-cp39-cp39-linux_x86_64.whl

32.2 MB 2025-01-26T07:36:42Z
Source code (zip)

2025-01-26T06:55:25Z
Source code (tar.gz)

2025-01-26T06:55:25Z

21 Jan 00:14

Qubitium

v1.7.3

5c1a7e8

GPTQModel v1.7.3

What's Changed

⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.

Add util.tensor_parameters() by @ZX-ModelCloud in #1107
add require_dtype by @LRL-ModelCloud in #1109
[MODEL] Add Telechat2 (China Telecom) by @1096125073 in #1106
[FIX] Filter weight-sharing tensors when save by @ZX-ModelCloud in #1112
Add telechat test by @LRL-ModelCloud in #1111
[FIX] fix convert_gptq_to_mlx_weights by @LRL-ModelCloud in #1113
add test_parameter_count.py by @ZX-ModelCloud in #1115
Add gpqa eval task by @CL-ModelCloud in #1117
[FIX] Call tied_weights() after load_checkpoint_in_model() by @ZX-ModelCloud in #1119
add phimoe support by @CSY-ModelCloud in #1118

New Contributors

@1096125073 made their first contribution in #1106

Full Changelog: v1.7.2...v1.7.3

Contributors

1096125073, ZX-ModelCloud, and 3 other contributors

Assets 52

19 Jan 03:52

Qubitium

v1.7.2

d762379

GPTQModel v1.7.2

What's Changed

⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.

remove catching module error by @CSY-ModelCloud in #1088
[FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by @LRL-ModelCloud in #1090
[FIX] monkey patch only once by @LRL-ModelCloud in #1091
check CC >= 8 for marlin, fixed #1092 by @CSY-ModelCloud in #1093
check compute capability for marlin in validate_device() by @CSY-ModelCloud in #1095
torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by @CSY-ModelCloud in #1096
fix local model path & marlin test by @CSY-ModelCloud in #1097
mod bits info by @CL-ModelCloud in #1100
Reduce memory usage in mlx conversion by @Qubitium in #1099
cleanup mlx code by @Qubitium in #1101

Full Changelog: v1.7.0...v1.7.2

Contributors

Qubitium, LRL-ModelCloud, and 2 other contributors

Assets 52

17 Jan 01:34

Qubitium

v1.7.0

d247fd0

GPTQModel v1.7.0

What's Changed

⚡backend.MLX added for runtime-conversion and execution of GPTQ models on Apple's MLX framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py not correctly detecting incompatible setuptools/wheel pkgs.

[CI] run tests with linux tag by @CSY-ModelCloud in #1067
Add backend.MLX by @LRL-ModelCloud in #1061
add mlx generate test by @CL-ModelCloud in #1069
[CI] upload source in build step by @CSY-ModelCloud in #1070
code review by @CL-ModelCloud in #1072
[CI] install mlx by @CSY-ModelCloud in #1071
Add option to quantize lm_head by @ZX-ModelCloud in #1037
fix test_packing by @LRL-ModelCloud in #1073
[CI] add mlx test by @CSY-ModelCloud in #1074
[CI] fix ci relase env name by @CSY-ModelCloud in #1078
update mlx test by @CSY-ModelCloud in #1079
convert to mlx support desc_act true by @LRL-ModelCloud in #1082
[CI] add extra-index-url for pip install by @CSY-ModelCloud in #1083
catch module error for setup.py by @CSY-ModelCloud in #1084

Full Changelog: v1.6.1...v1.7.0

Contributors

ZX-ModelCloud, LRL-ModelCloud, and 2 other contributors

Assets 52

09 Jan 03:40

Qubitium

v1.6.1

0c6452b

GPTQModel v1.6.1

What's Changed

🎉 New OpenAI api compatible end-point via model.serve(host, port).
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False loading regression.

code opt by @CL-ModelCloud in #1038
fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
add global rocm check by @CSY-ModelCloud in #1043
[FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
enable flash attn for loading quantized by @CSY-ModelCloud in #1045
add flash_attn2 test by @CSY-ModelCloud in #1047
enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
move flash attn test to correct folder by @CSY-ModelCloud in #1052
Expose openai server api by @CL-ModelCloud in #1048
update openai server by @CL-ModelCloud in #1058
don't download whl for xpu env by @CSY-ModelCloud in #1059
remove build tag for normal release by @CSY-ModelCloud in #1063
disable flash attn 2 for internlm by @CSY-ModelCloud in #1065

Full Changelog: v1.6.0...v1.6.1

Contributors

LRL-ModelCloud, CL-ModelCloud, and CSY-ModelCloud

Assets 51

06 Jan 08:00

Qubitium

v1.6.0

c5c2677

GPTQModel v1.6.0

What's Changed

⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.

note about batch_size to speed up quant by @Qubitium in #992
Add ROCm support by @CSY-ModelCloud in #993
Add bits test by @ZX-ModelCloud in #995
note about rocm support by @Qubitium in #998
[FIX] wrong variable name by @ZX-ModelCloud in #997
update rocm version tag by @CSY-ModelCloud in #999
Auto-tokenizer will be called within load() by @LRL-ModelCloud in #996
update transformers by @Qubitium in #1001
[FIX] torch qlinear forward by @ZX-ModelCloud in #1002
cleanup marlin info by @Qubitium in #1004
Use custom forward hook by @LRL-ModelCloud in #1003
fix hooked linear init by @LRL-ModelCloud in #1011
add HookedConv1D by @LRL-ModelCloud in #1012
record fwd time by @LRL-ModelCloud in #1013
add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by @CSY-ModelCloud in #1015
[FIX] quantize_config could not read from config.json by @ZX-ModelCloud in #1022
Fix quant time by @LRL-ModelCloud in #1025
fix forward hook by @LRL-ModelCloud in #1027
Fix hooked conv2d by @LRL-ModelCloud in #1030
clean cache by @CL-ModelCloud in #1032

Full Changelog: v1.5.1...v1.6.0

Contributors

Qubitium, ZX-ModelCloud, and 3 other contributors

Assets 52

01 Jan 08:39

Qubitium

v1.5.1

4f18747

GPTQModel v1.5.1

What's Changed

🎉 2025!

⚡ Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.
💫 Improve QuantLinear selection from optimum.
🐛 Fix attn_implementation_autoset compat in latest transformers.

Add QuantizeConfig.device and use. by @Qubitium in #950
fix hf_select_quant_linear by @LRL-ModelCloud in #966
update vllm gptq_marlin code by @ZX-ModelCloud in #967
fix cuda:0 not a enum device by @CSY-ModelCloud in #968
fix marlin info for non-cuda device by @Qubitium in #972
fix backend str bug by @CL-ModelCloud in #973
hf select quant_linear with pack by @LRL-ModelCloud in #969
remove auto select BACKEND.IPEX by @CSY-ModelCloud in #975
fix autoround received a device_map by @CSY-ModelCloud in #976
use enum instead of magic number by @CSY-ModelCloud in #979
use new ci docker images by @CSY-ModelCloud in #980
fix flash attntion was auto loaded on cpu for pretrained model by @CSY-ModelCloud in #981
fix old transformer doesn't have _attn_implementation_autoset by @CSY-ModelCloud in #982
fix gptbigcode test temporally by @CSY-ModelCloud in #983
fix version parsing by @CSY-ModelCloud in #985

Full Changelog: v1.5.0...v1.5.1

Contributors

Qubitium, ZX-ModelCloud, and 3 other contributors

Assets 52

24 Dec 02:01

Qubitium

v1.5.0

4197cd8

GPTQModel v1.5.0

What's Changed

⚡ Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
🐛 Fixed Qwen 2-VL model quantization vram usage and post-quant file copy of relevant config files.
🐛 Fixed install/compilations in envs with wrong TORCH_CUDA_ARCH_LIST set (Nvidia docker images)
🐛 Warn about bad torch[cuda] install on Windows

Fix backend not ipex by @CSY-ModelCloud in #930
Fix broken ipex check by @Qubitium in #933
Fix dynamic_cuda validation by @CSY-ModelCloud in #936
Fix bdist_wheel does not exist on old setuptools by @CSY-ModelCloud in #939
Add cuda warning on windows by @CSY-ModelCloud in #942
Add torch inference benchmark by @CL-ModelCloud in #940
Add modality to BaseModel by @ZX-ModelCloud in #937
[FIX] qwen_vl_utils should be locally import by @ZX-ModelCloud in #946
Filter torch cuda arch < 6.0 by @CSY-ModelCloud in #955
[FIX] wrong filepath was used when model_id_or_path was hugging model id by @ZX-ModelCloud in #956
Fix import error was not caught by @CSY-ModelCloud in #961

Full Changelog: v1.4.5...v1.5.0

Contributors

Qubitium, ZX-ModelCloud, and 2 other contributors

Assets 52

19 Dec 12:16

Qubitium

v1.4.5

9012892

GPTQModel v1.4.5

What's Changed

⚡ Windows 11 support added/validated with DynamicCuda and Torch kernels.
⚡ Ovis 1.6 VL model support with image data calibration.
⚡ Reduced quantization vram usage.
🐛 Fixed dynamic controlled layer loading logic

Refractor by @Qubitium in #895
Add platform check by @LRL-ModelCloud in #899
Exclude marlin & exllama on windows by @CSY-ModelCloud in #898
Remove unnecessary backslash in the expression & typehint by @CSY-ModelCloud in #903
Add DEVICE.ALL by @LRL-ModelCloud in #901
[FIX] the error of loading quantized model with dynamic by @ZX-ModelCloud in #907
[FIX] gpt2 quantize error by @ZX-ModelCloud in #912
Simplify checking generated str for vllm test & fix transformers version for cohere2 by @CSY-ModelCloud in #914
[MODEL] add OVIS support by @ZX-ModelCloud in #685
Fix IDE warning marlin not in all by @CSY-ModelCloud in #920

Full Changelog: v1.4.4...v1.4.5

Contributors

Qubitium, ZX-ModelCloud, and 2 other contributors

Assets 52

17 Dec 14:48

Qubitium

v1.4.4

92266fa

GPTQModel v1.4.4 Patch

What's Changed

⚡ Reduced memory usage during quantization
⚡ Fix device_map={"":"auto"} compat

Speed up unit tests by @Qubitium in #885
[FIX] hf select quant linear parse device map by @ZX-ModelCloud in #887
Avoid cloning on gpu by @Qubitium in #886
Expose hf_quantize() by @ZX-ModelCloud in #888
Update integration hf code by @ZX-ModelCloud in #891
Add back fasterquant() for compat by @Qubitium in #892

Full Changelog: v1.4.2...v1.4.4

Contributors

Qubitium and ZX-ModelCloud

Assets 52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: ModelCloud/GPTQModel

GPTQModel v1.7.4

What's Changed

New Contributors

Contributors

GPTQModel v1.7.3

What's Changed

New Contributors

Contributors

GPTQModel v1.7.2

What's Changed

Contributors

GPTQModel v1.7.0

What's Changed

Contributors

GPTQModel v1.6.1

What's Changed

Contributors

GPTQModel v1.6.0

What's Changed

Contributors

GPTQModel v1.5.1

What's Changed

Contributors

GPTQModel v1.5.0

What's Changed

Contributors

GPTQModel v1.4.5

What's Changed

Contributors

GPTQModel v1.4.4 Patch

What's Changed

Contributors