codewithdark-git · codewithdark-git · May 28, 2025 · May 27, 2025 · May 28, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,74 @@
+name: QuantLLM CI/CD
+
+on:
+  push:
+    branches: [ main ]
+    tags:
+      - 'v*'
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11"]
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -e .[dev,test,gguf]
+        pip install pytest pytest-cov black isort
+
+    - name: Check code formatting
+      run: |
+        black . --check
+        isort . --check-only
+
+    - name: Run tests
+      run: |
+        pytest tests/ --cov=quantllm --cov-report=xml
+
+    - name: Upload coverage to Codecov
+      uses: codecov/codecov-action@v3
+      with:
+        file: ./coverage.xml
+        fail_ci_if_error: true
+
+  publish:
+    needs: test
+    runs-on: ubuntu-latest
+    if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: "3.10"
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install build twine
+
+    - name: Build package
+      run: python -m build
+
+    - name: Publish to PyPI
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
+      run: |
+        twine check dist/*
+        twine upload dist/* 
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,42 @@
+name: Documentation
+
+on:
+  push:
+    branches: [ main ]
+    paths:
+      - 'docs/**'
+      - '.github/workflows/docs.yml'
+  pull_request:
+    branches: [ main ]
+    paths:
+      - 'docs/**'
+
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: "3.10"
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -e .[docs]
+        pip install sphinx sphinx-rtd-theme
+
+    - name: Build documentation
+      run: |
+        cd docs
+        make html
+
+    - name: Deploy to GitHub Pages
+      if: github.event_name == 'push' && github.ref == 'refs/heads/main'
+      uses: peaceiris/actions-gh-pages@v3
+      with:
+        github_token: ${{ secrets.GITHUB_TOKEN }}
+        publish_dir: ./docs/_build/html 
diff --git a/README.md b/README.md
@@ -1,44 +1,72 @@
-# 🧠 QuantLLM: Lightweight Library for Quantized LLM Fine-Tuning and Deployment
+# 🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment
 
 [![PyPI Downloads](https://static.pepy.tech/badge/quantllm)](https://pepy.tech/projects/quantllm)
 <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/quantllm?logo=pypi&label=version&">
 
-
 ## 📌 Overview
 
-**QuantLLM** is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) **efficiently** using **4-bit and 8-bit quantization** techniques. It provides a modular and flexible framework for:
-
-- **Loading and quantizing models** with advanced configurations
-- **LoRA / QLoRA-based fine-tuning** with customizable parameters
-- **Dataset management** with preprocessing and splitting
-- **Training and evaluation** with comprehensive metrics
-- **Model checkpointing** and versioning
-- **Hugging Face Hub integration** for model sharing
+**QuantLLM** is a Python library designed for efficient model quantization using the GGUF (GGML Universal Format) method. It provides a robust framework for converting and deploying large language models with minimal memory footprint and optimal performance. Key capabilities include:
 
-The goal of QuantLLM is to **democratize LLM training**, especially in low-resource environments, while keeping the workflow intuitive, modular, and production-ready.
+- **Memory-efficient GGUF quantization** with multiple precision options (2-bit to 8-bit)
+- **Chunk-based processing** for handling large models
+- **Comprehensive benchmarking** tools
+- **Detailed progress tracking** with memory statistics
+- **Easy model export** and deployment
 
 ## 🎯 Key Features
 
 | Feature                          | Description |
 |----------------------------------|-------------|
-| ✅ Quantized Model Loading       | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. |
-| ✅ Advanced Dataset Management   | Load, preprocess, and split datasets with flexible configurations |
-| ✅ LoRA / QLoRA Fine-Tuning      | Memory-efficient fine-tuning with customizable LoRA parameters |
-| ✅ Comprehensive Training        | Advanced training loop with mixed precision, gradient accumulation, and early stopping |
-| ✅ Model Evaluation             | Flexible evaluation with custom metrics and batch processing |
-| ✅ Checkpoint Management        | Save, resume, and manage training checkpoints with versioning |
-| ✅ Hub Integration              | Push models and checkpoints to Hugging Face Hub with authentication |
-| ✅ Configuration Management     | YAML/JSON config support for reproducible experiments |
-| ✅ Logging and Monitoring       | Comprehensive logging and Weights & Biases integration |
+| ✅ Multiple GGUF Types          | Support for various GGUF quantization types (Q2_K to Q8_0) with different precision-size tradeoffs |
+| ✅ Memory Optimization          | Chunk-based processing and CPU offloading for efficient handling of large models |
+| ✅ Progress Tracking            | Detailed layer-wise progress with memory statistics and ETA |
+| ✅ Benchmarking Tools           | Comprehensive benchmarking suite for performance evaluation |
+| ✅ Hardware Optimization        | Automatic device selection and memory management |
+| ✅ Easy Deployment              | Simple conversion to GGUF format for deployment |
+| ✅ Flexible Configuration       | Customizable quantization parameters and processing options |
 
 ## 🚀 Getting Started
 
 ### Installation
 
+Basic installation:
 ```bash
 pip install quantllm
 ```
 
+With GGUF support (recommended):
+```bash
+pip install quantllm[gguf]
+```
+
+### Quick Example
+
+```python
+from quantllm import QuantLLM
+from transformers import AutoTokenizer
+
+# Load tokenizer and prepare data
+model_name = "facebook/opt-125m"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+calibration_text = ["Example text for calibration."] * 10
+calibration_data = tokenizer(calibration_text, return_tensors="pt", padding=True)["input_ids"]
+
+# Quantize model
+quantized_model, benchmark_results = QuantLLM.quantize_from_pretrained(
+    model_name_or_path=model_name,
+    bits=4,                    # Quantization bits (2-8)
+    group_size=32,            # Group size for quantization
+    quant_type="Q4_K_M",      # GGUF quantization type
+    calibration_data=calibration_data,
+    benchmark=True,           # Run benchmarks
+    benchmark_input_shape=(1, 32)
+)
+
+# Save and convert to GGUF
+QuantLLM.save_quantized_model(model=quantized_model, output_path="quantized_model")
+QuantLLM.convert_to_gguf(model=quantized_model, output_path="model.gguf")
+```
+
 For detailed usage examples and API documentation, please refer to our:
 - 📚 [Official Documentation](https://quantllm.readthedocs.io/)
 - 🎓 [Tutorials](https://quantllm.readthedocs.io/tutorials/)
@@ -48,39 +76,41 @@ For detailed usage examples and API documentation, please refer to our:
 
 ### Minimum Requirements
 - **CPU**: 4+ cores
-- **RAM**: 16GB
-- **Storage**: 20GB free space
-- **Python**: 3.8+
+- **RAM**: 16GB+
+- **Storage**: 10GB+ free space
+- **Python**: 3.10+
 
-### Recommended Requirements
+### Recommended for Large Models
+- **CPU**: 8+ cores
+- **RAM**: 32GB+
 - **GPU**: NVIDIA GPU with 8GB+ VRAM
-- **RAM**: 32GB
-- **Storage**: 50GB+ SSD
 - **CUDA**: 11.7+
+- **Storage**: 20GB+ free space
+
+### GGUF Quantization Types
 
-### Resource Usage Guidelines
-| Model Size | 4-bit (GPU RAM) | 8-bit (GPU RAM) | CPU RAM (min) |
-|------------|----------------|-----------------|---------------|
-| 3B params  | ~6GB          | ~9GB           | 16GB         |
-| 7B params  | ~12GB         | ~18GB          | 32GB         |
-| 13B params | ~20GB         | ~32GB          | 64GB         |
-| 70B params | ~90GB         | ~140GB         | 256GB        |
+| Type    | Bits | Description           | Use Case                    |
+|---------|------|-----------------------|-----------------------------|
+| Q2_K    | 2    | Extreme compression   | Size-critical deployment   |
+| Q3_K_S  | 3    | Small size           | Limited storage            |
+| Q4_K_M  | 4    | Balanced quality     | General use                |
+| Q5_K_M  | 5    | Higher quality       | Quality-sensitive tasks    |
+| Q8_0    | 8    | Best quality         | Accuracy-critical tasks    |
 
 ## 🔄 Version Compatibility
 
 | QuantLLM | Python | PyTorch | Transformers | CUDA  |
 |----------|--------|----------|--------------|-------|
-| latest    | ≥3.10   | ≥2.0.0   | ≥4.30.0     | ≥11.7 |
+| 1.2.0    | ≥3.10  | ≥2.0.0   | ≥4.30.0     | ≥11.7 |
 
 ## 🗺 Roadmap
 
-- [ ] Multi-GPU training support
-- [ ] AutoML for hyperparameter tuning
-- [ ] Integration of additional advanced quantization algorithms and techniques.
-- [ ] Custom model architecture support
-- [ ] Enhanced logging and visualization
-- [ ] Model compression techniques
-- [ ] Deployment optimizations
+- [ ] Support for more GGUF model architectures
+- [ ] Enhanced benchmarking capabilities
+- [ ] Multi-GPU processing support
+- [ ] Advanced memory optimization techniques
+- [ ] Integration with more deployment platforms
+- [ ] Custom quantization kernels
 
 ## 🤝 Contributing
 
@@ -92,14 +122,12 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
 
 ## 🙏 Acknowledgments
 
-- [HuggingFace](https://huggingface.co/) for their amazing Transformers library
-- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for quantization
-- [PEFT](https://github.com/huggingface/peft) for parameter-efficient fine-tuning
-- [Weights & Biases](https://wandb.ai/) for experiment tracking
+- [llama.cpp](https://github.com/ggerganov/llama.cpp) for GGUF format
+- [HuggingFace](https://huggingface.co/) for Transformers library
+- [CTransformers](https://github.com/marella/ctransformers) for GGUF support
 
 ## 📫 Contact & Support
 
-- GitHub Issues: [Create an issue](https://github.com/yourusername/QuantLLM/issues)
+- GitHub Issues: [Create an issue](https://github.com/codewithdark-git/QuantLLM/issues)
 - Documentation: [Read the docs](https://quantllm.readthedocs.io/)
-- Discord: [Join our community](https://discord.gg/quantllm)
-- Email: [email protected]
+- Email: [email protected]
diff --git a/docs/api_reference/model.rst b/docs/api_reference/model.rst