A comprehensive toolkit for quantizing large language models to GGUF format with support for multiple acceleration backends (CUDA, Metal, CPU).
Feature | Status | Description |
---|---|---|
π₯οΈ Bare Metal | β | Native installation without Docker |
π§ Auto Setup | β | Automatic environment detection and configuration |
π― Multi-Backend | β | CUDA, Metal (Apple Silicon), and CPU support |
π¦ Conda Ready | β | Complete conda environment with all dependencies |
β‘ Quick Scripts | β | Convenient scripts for common tasks |
π Perplexity | β | Automated quality testing of quantized models |
π Validation | β | Environment health checks and troubleshooting |
Requirement | Minimum Version | Notes |
---|---|---|
Conda | Latest | Miniconda or Anaconda |
Python | 3.11+ | Installed via conda |
Git | 2.0+ | For repository operations |
CMake | 3.14+ | For building llama.cpp |
Platform | Requirements | Acceleration |
---|---|---|
NVIDIA | CUDA 11.8+ | β CUDA acceleration |
Apple Silicon | macOS + M1/M2/M3 | β Metal acceleration |
Others | Any CPU | β Optimized CPU processing |
# Clone the repository
git clone https://github.com/Vikhrmodels/quantization-utils.git
cd quantization-utils
# Run the automated setup script
chmod +x scripts/setup.sh
./scripts/setup.sh
# Create conda environment (OS-specific)
# For Linux:
conda env create -f environment-linux.yml
# For macOS:
conda env create -f environment-macos.yml
# Generic (fallback):
conda env create -f environment.yml
# Activate environment
conda activate quantization-utils
# Run setup to install llama.cpp and prepare directories
python setup.py
# Add to PATH (if needed)
export PATH="$HOME/.local/bin:$PATH"
Verify your installation:
# Check environment health
./scripts/validate.sh
# Quick test
conda activate quantization-utils
cd GGUF
python -c "from shared import validate_environment; validate_environment()"
# Activate environment
conda activate quantization-utils
# Quantize a model with default settings
./scripts/quantize.sh microsoft/DialoGPT-medium
# Custom quantization levels
./scripts/quantize.sh Vikhrmodels/Vikhr-Gemma-2B-instruct -q Q4_K_M,Q5_K_M,Q8_0
# Force re-quantization
./scripts/quantize.sh microsoft/DialoGPT-medium --force
cd GGUF
# Full pipeline with all quantization levels
python pipeline.py --model_id microsoft/DialoGPT-medium
# Specific quantization levels only
python pipeline.py --model_id microsoft/DialoGPT-medium -q Q4_K_M -q Q8_0
# With perplexity testing
python pipeline.py --model_id microsoft/DialoGPT-medium --perplexity
# For gated models (requires HF token)
python pipeline.py --model_id meta-llama/Llama-2-7b-hf --hf_token $HF_TOKEN
# Test all quantized versions
./scripts/perplexity.sh microsoft/DialoGPT-medium
# Force recalculation
./scripts/perplexity.sh microsoft/DialoGPT-medium --force
quantization-utils/
βββ π environment.yml # Conda environment definition
βββ π setup.py # Environment setup script
βββ π README.md # This file
β
βββ π§ scripts/ # Convenience scripts
β βββ setup.sh # Automated setup
β βββ validate.sh # Environment validation
β βββ quantize.sh # Quick quantization
β βββ perplexity.sh # Perplexity testing
β
βββ π¦ GGUF/ # Main processing directory
βββ π pipeline.py # Main pipeline script
βββ π shared.py # Shared utilities
βββ π models/ # Downloaded models
βββ π imatrix/ # Importance matrices
βββ π output/ # Final quantized models
βββ π resources/ # Calibration data
β βββ standard_cal_data/
βββ π modules/ # Processing modules
βββ convert.py
βββ quantize.py
βββ imatrix.py
βββ perplexity.py
Variable | Description | Example |
---|---|---|
HF_TOKEN |
HuggingFace API token | hf_... |
CUDA_VISIBLE_DEVICES |
GPU selection | 0,1 |
OMP_NUM_THREADS |
CPU threads | 8 |
Parameter | Description | Default |
---|---|---|
--model_id |
HuggingFace model ID | Required |
--quants |
Quantization levels | All levels |
--force |
Force reprocessing | False |
--perplexity |
Run quality tests | False |
--threads |
Processing threads | CPU count |
Level | Description | Size | Quality |
---|---|---|---|
Q2_K |
2-bit quantization | Smallest | Good |
Q4_K_M |
4-bit mixed | Balanced | Very Good |
Q5_K_M |
5-bit mixed | Larger | Excellent |
Q6_K |
6-bit | Large | Near Original |
Q8_0 |
8-bit | Largest | Original |
Issue | Solution |
---|---|
conda: command not found |
Install Miniconda/Anaconda |
llama-quantize: not found |
Run python setup.py |
CUDA out of memory |
Reduce batch size or use CPU |
Permission denied |
Check file permissions with chmod +x |
PackagesNotFoundError |
Use OS-specific environment file |
# Reset environment
conda env remove -n quantization-utils
# Recreate with OS-specific file
# Linux:
conda env create -f environment-linux.yml
# macOS:
conda env create -f environment-macos.yml
# Reinstall llama.cpp
rm -rf ~/.local/bin/llama-*
python setup.py
# Check installation
./scripts/validate.sh
# Manual llama.cpp installation
cd /tmp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/.local
make -j$(nproc)
make install
- Update
shared.py
with newQuant
enum values - Modify
modules/quantize.py
to handle new methods - Update pipeline default quantization list
- Test with validation scripts
# Add to GGUF/resources/standard_cal_data/
# Files should be UTF-8 text with one sample per line
Tip | Description |
---|---|
π GPU Usage | Use CUDA/Metal for 5-10x speedup |
πΎ Memory | Monitor RAM usage with large models |
π Batch Size | Adjust based on available memory |
π Threads | Set to CPU core count for optimal CPU performance |
- Fork the repository
- Create a feature branch
- Test with
./scripts/validate.sh
- Submit a pull request
This project is licensed under the terms specified in the LICENSE file.
- llama.cpp: https://github.com/ggerganov/llama.cpp
- HuggingFace: https://huggingface.co/
- GGUF Format: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
Ready to quantize? Start with ./scripts/setup.sh
π