GPTQModel v1.6.0
What's Changed
⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.
- note about
batch_size
to speed up quant by @Qubitium in #992 - Add ROCm support by @CSY-ModelCloud in #993
- Add bits test by @ZX-ModelCloud in #995
- note about rocm support by @Qubitium in #998
- [FIX] wrong variable name by @ZX-ModelCloud in #997
- update rocm version tag by @CSY-ModelCloud in #999
- Auto-tokenizer will be called within
load()
by @LRL-ModelCloud in #996 - update transformers by @Qubitium in #1001
- [FIX] torch qlinear forward by @ZX-ModelCloud in #1002
- cleanup marlin info by @Qubitium in #1004
- Use custom forward hook by @LRL-ModelCloud in #1003
- fix hooked linear init by @LRL-ModelCloud in #1011
- add HookedConv1D by @LRL-ModelCloud in #1012
- record fwd time by @LRL-ModelCloud in #1013
- add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by @CSY-ModelCloud in #1015
- [FIX] quantize_config could not read from config.json by @ZX-ModelCloud in #1022
- Fix quant time by @LRL-ModelCloud in #1025
- fix forward hook by @LRL-ModelCloud in #1027
- Fix hooked conv2d by @LRL-ModelCloud in #1030
- clean cache by @CL-ModelCloud in #1032
Full Changelog: v1.5.1...v1.6.0