Skip to content

VikhrModels/quantization-utils

Repository files navigation

Quantization Utils - Bare Metal Setup

A comprehensive toolkit for quantizing large language models to GGUF format with support for multiple acceleration backends (CUDA, Metal, CPU).

πŸš€ Features

Feature Status Description
πŸ–₯️ Bare Metal βœ… Native installation without Docker
πŸ”§ Auto Setup βœ… Automatic environment detection and configuration
🎯 Multi-Backend βœ… CUDA, Metal (Apple Silicon), and CPU support
πŸ“¦ Conda Ready βœ… Complete conda environment with all dependencies
⚑ Quick Scripts βœ… Convenient scripts for common tasks
πŸ“Š Perplexity βœ… Automated quality testing of quantized models
πŸ” Validation βœ… Environment health checks and troubleshooting

πŸ“‹ Prerequisites

Requirement Minimum Version Notes
Conda Latest Miniconda or Anaconda
Python 3.11+ Installed via conda
Git 2.0+ For repository operations
CMake 3.14+ For building llama.cpp

GPU Support (Optional)

Platform Requirements Acceleration
NVIDIA CUDA 11.8+ βœ… CUDA acceleration
Apple Silicon macOS + M1/M2/M3 βœ… Metal acceleration
Others Any CPU βœ… Optimized CPU processing

πŸ› οΈ Quick Setup

Option 1: Automated Setup (Recommended)

# Clone the repository
git clone https://github.com/Vikhrmodels/quantization-utils.git
cd quantization-utils

# Run the automated setup script
chmod +x scripts/setup.sh
./scripts/setup.sh

Option 2: Manual Setup

# Create conda environment (OS-specific)
# For Linux:
conda env create -f environment-linux.yml
# For macOS:
conda env create -f environment-macos.yml
# Generic (fallback):
conda env create -f environment.yml

# Activate environment
conda activate quantization-utils

# Run setup to install llama.cpp and prepare directories
python setup.py

# Add to PATH (if needed)
export PATH="$HOME/.local/bin:$PATH"

πŸ” Validation

Verify your installation:

# Check environment health
./scripts/validate.sh

# Quick test
conda activate quantization-utils
cd GGUF
python -c "from shared import validate_environment; validate_environment()"

πŸ“Š Usage Examples

Basic Model Quantization

# Activate environment
conda activate quantization-utils

# Quantize a model with default settings
./scripts/quantize.sh microsoft/DialoGPT-medium

# Custom quantization levels
./scripts/quantize.sh Vikhrmodels/Vikhr-Gemma-2B-instruct -q Q4_K_M,Q5_K_M,Q8_0

# Force re-quantization
./scripts/quantize.sh microsoft/DialoGPT-medium --force

Advanced Pipeline Usage

cd GGUF

# Full pipeline with all quantization levels
python pipeline.py --model_id microsoft/DialoGPT-medium

# Specific quantization levels only
python pipeline.py --model_id microsoft/DialoGPT-medium -q Q4_K_M -q Q8_0

# With perplexity testing
python pipeline.py --model_id microsoft/DialoGPT-medium --perplexity

# For gated models (requires HF token)
python pipeline.py --model_id meta-llama/Llama-2-7b-hf --hf_token $HF_TOKEN

Perplexity Testing

# Test all quantized versions
./scripts/perplexity.sh microsoft/DialoGPT-medium

# Force recalculation
./scripts/perplexity.sh microsoft/DialoGPT-medium --force

πŸ“ Directory Structure

quantization-utils/
β”œβ”€β”€ πŸ“„ environment.yml          # Conda environment definition
β”œβ”€β”€ 🐍 setup.py                 # Environment setup script
β”œβ”€β”€ πŸ“– README.md                # This file
β”‚
β”œβ”€β”€ πŸ”§ scripts/                 # Convenience scripts
β”‚   β”œβ”€β”€ setup.sh               # Automated setup
β”‚   β”œβ”€β”€ validate.sh             # Environment validation
β”‚   β”œβ”€β”€ quantize.sh             # Quick quantization
β”‚   └── perplexity.sh           # Perplexity testing
β”‚
└── πŸ“¦ GGUF/                    # Main processing directory
    β”œβ”€β”€ 🐍 pipeline.py          # Main pipeline script
    β”œβ”€β”€ 🐍 shared.py            # Shared utilities
    β”œβ”€β”€ πŸ“ models/              # Downloaded models
    β”œβ”€β”€ πŸ“ imatrix/             # Importance matrices
    β”œβ”€β”€ πŸ“ output/              # Final quantized models
    β”œβ”€β”€ πŸ“ resources/           # Calibration data
    β”‚   └── standard_cal_data/
    └── πŸ“ modules/             # Processing modules
        β”œβ”€β”€ convert.py
        β”œβ”€β”€ quantize.py
        β”œβ”€β”€ imatrix.py
        └── perplexity.py

βš™οΈ Configuration Options

Environment Variables

Variable Description Example
HF_TOKEN HuggingFace API token hf_...
CUDA_VISIBLE_DEVICES GPU selection 0,1
OMP_NUM_THREADS CPU threads 8

Pipeline Parameters

Parameter Description Default
--model_id HuggingFace model ID Required
--quants Quantization levels All levels
--force Force reprocessing False
--perplexity Run quality tests False
--threads Processing threads CPU count

Quantization Levels

Level Description Size Quality
Q2_K 2-bit quantization Smallest Good
Q4_K_M 4-bit mixed Balanced Very Good
Q5_K_M 5-bit mixed Larger Excellent
Q6_K 6-bit Large Near Original
Q8_0 8-bit Largest Original

πŸ› Troubleshooting

Common Issues

Issue Solution
conda: command not found Install Miniconda/Anaconda
llama-quantize: not found Run python setup.py
CUDA out of memory Reduce batch size or use CPU
Permission denied Check file permissions with chmod +x
PackagesNotFoundError Use OS-specific environment file

Environment Problems

# Reset environment
conda env remove -n quantization-utils

# Recreate with OS-specific file
# Linux:
conda env create -f environment-linux.yml
# macOS:
conda env create -f environment-macos.yml

# Reinstall llama.cpp
rm -rf ~/.local/bin/llama-*
python setup.py

# Check installation
./scripts/validate.sh

Binary Issues

# Manual llama.cpp installation
cd /tmp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/.local
make -j$(nproc)
make install

πŸ”§ Development

Adding New Quantization Methods

  1. Update shared.py with new Quant enum values
  2. Modify modules/quantize.py to handle new methods
  3. Update pipeline default quantization list
  4. Test with validation scripts

Custom Calibration Data

# Add to GGUF/resources/standard_cal_data/
# Files should be UTF-8 text with one sample per line

πŸ“ˆ Performance Tips

Tip Description
πŸš€ GPU Usage Use CUDA/Metal for 5-10x speedup
πŸ’Ύ Memory Monitor RAM usage with large models
πŸ”„ Batch Size Adjust based on available memory
πŸ“Š Threads Set to CPU core count for optimal CPU performance

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Test with ./scripts/validate.sh
  4. Submit a pull request

πŸ“„ License

This project is licensed under the terms specified in the LICENSE file.

πŸ”— Links


Ready to quantize? Start with ./scripts/setup.sh πŸš€

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published