Skip to content

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

Notifications You must be signed in to change notification settings

codewithdark-git/QuantLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

QuantLLM Logo

๐Ÿš€ QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

Downloads PyPI - Version Python License Stars

Load โ†’ Quantize โ†’ Fine-tune โ†’ Export โ€” All in One Line

Quick Start โ€ข Features โ€ข Export Formats โ€ข Examples โ€ข Documentation


๐ŸŽฏ Why QuantLLM?

โŒ Without QuantLLM

# 50+ lines of configuration...
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    # ... more config
)
# Then llama.cpp compilation for GGUF...
# Then manual tensor conversion...

โœ… With QuantLLM

from quantllm import turbo

# One line does everything
model = turbo("meta-llama/Llama-3-8B")

# Generate
print(model.generate("Hello!"))

# Fine-tune
model.finetune(dataset, epochs=3)

# Export to any format
model.export("gguf", quantization="Q4_K_M")

โšก Quick Start

Installation

# Recommended installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With all export formats
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

Your First Model

from quantllm import turbo

# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("Explain quantum computing simply")
print(response)

# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

QuantLLM automatically:

  • โœ… Detects your GPU and available memory
  • โœ… Applies optimal 4-bit quantization
  • โœ… Enables Flash Attention 2 when available
  • โœ… Configures memory management

โœจ Features

๐Ÿ”ฅ TurboModel API

# One unified API for everything
model = turbo("mistralai/Mistral-7B")
model.generate("Hello!")
model.finetune(data, epochs=3)
model.export("gguf", quantization="Q4_K_M")
model.push("user/repo", format="gguf")

โšก Performance

  • Flash Attention 2 โ€” Auto-enabled
  • torch.compile โ€” 2x faster training
  • Dynamic Padding โ€” 50% less VRAM
  • Triton Kernels โ€” Fused operations

๐Ÿง  45+ Model Architectures

Llama 2/3, Mistral, Mixtral, Qwen 1/2, Phi 1/2/3, Gemma, Falcon, DeepSeek, Yi, StarCoder, ChatGLM, InternLM, Baichuan, StableLM, BLOOM, OPT, MPT, GPT-NeoX...

๐Ÿ“ฆ Multi-Format Export

  • GGUF โ€” llama.cpp, Ollama, LM Studio
  • ONNX โ€” ONNX Runtime, TensorRT
  • MLX โ€” Apple Silicon (M1/M2/M3/M4)
  • SafeTensors โ€” HuggingFace

๐ŸŽจ Beautiful UI

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  ๐Ÿš€ QuantLLM v2.0                  โ•‘
โ•‘  โœ“ GGUF  โœ“ ONNX  โœ“ MLX             โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“Š Model: meta-llama/Llama-3.2-3B
   Parameters: 3.21B
   Memory: 6.4 GB โ†’ 1.9 GB (70% saved)

๐Ÿค— One-Click Hub Publishing

# Auto-generates model cards with:
# - YAML frontmatter
# - Usage examples  
# - "Use this model" button

model.push("user/my-model", format="gguf")

๐Ÿ“ฆ Export Formats

Export to any deployment target with a single line:

from quantllm import turbo

model = turbo("microsoft/phi-3-mini")

# GGUF โ€” For llama.cpp, Ollama, LM Studio
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# ONNX โ€” For ONNX Runtime, TensorRT  
model.export("onnx", "./model-onnx/")

# MLX โ€” For Apple Silicon Macs
model.export("mlx", "./model-mlx/", quantization="4bit")

# SafeTensors โ€” For HuggingFace
model.export("safetensors", "./model-hf/")

GGUF Quantization Types

Type Bits Quality Use Case
Q2_K 2-bit Low Minimum size
Q3_K_M 3-bit Fair Very constrained
Q4_K_M 4-bit Good Recommended โญ
Q5_K_M 5-bit High Quality-focused
Q6_K 6-bit Very High Near-original
Q8_0 8-bit Excellent Best quality

๐ŸŽฎ Examples

Chat with Any Model

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

# Simple generation
response = model.generate(
    "Write a Python function for fibonacci",
    max_new_tokens=200,
    temperature=0.7,
)
print(response)

# Chat format
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)

Load GGUF Models from HuggingFace

from quantllm import TurboModel

# Load any GGUF model directly
model = TurboModel.from_gguf(
    "TheBloke/Llama-2-7B-Chat-GGUF", 
    filename="llama-2-7b-chat.Q4_K_M.gguf"
)

print(model.generate("Hello!"))

Fine-Tune with Your Data

from quantllm import turbo

model = turbo("mistralai/Mistral-7B")

# Simple โ€” everything auto-configured
model.finetune("training_data.json", epochs=3)

# Advanced โ€” full control
model.finetune(
    "training_data.json",
    epochs=5,
    learning_rate=2e-4,
    lora_r=32,
    lora_alpha=64,
    batch_size=4,
)

Supported data formats:

[
  {"instruction": "What is Python?", "output": "Python is..."},
  {"text": "Full text for language modeling"},
  {"prompt": "Question", "completion": "Answer"}
]

Push to HuggingFace Hub

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

# Push with auto-generated model card
model.push(
    "your-username/my-model",
    format="gguf",
    quantization="Q4_K_M",
    license="apache-2.0"
)

The model card includes:

  • โœ… Proper YAML frontmatter (library_name, tags, base_model)
  • โœ… Format-specific usage examples
  • โœ… "Use this model" button compatibility
  • โœ… Quantization details

๐Ÿ’ป Hardware Requirements

Configuration GPU VRAM Models
๐ŸŸข Entry 6-8 GB 1-7B (4-bit)
๐ŸŸก Mid-Range 12-24 GB 7-30B (4-bit)
๐Ÿ”ด High-End 24-80 GB 70B+

Tested GPUs: RTX 3060/3070/3080/3090/4070/4080/4090, A100, H100, Apple M1/M2/M3/M4


๐Ÿ“ฆ Installation Options

# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With specific features
pip install "quantllm[gguf]"     # GGUF export
pip install "quantllm[onnx]"     # ONNX export  
pip install "quantllm[mlx]"      # MLX export (Apple Silicon)
pip install "quantllm[triton]"   # Triton kernels
pip install "quantllm[full]"     # Everything

๐Ÿ—๏ธ Architecture

quantllm/
โ”œโ”€โ”€ core/                    # Core functionality
โ”‚   โ”œโ”€โ”€ turbo_model.py      # TurboModel unified API
โ”‚   โ”œโ”€โ”€ smart_config.py     # Auto-configuration
โ”‚   โ””โ”€โ”€ export.py           # Universal exporter
โ”œโ”€โ”€ quant/                   # Quantization
โ”‚   โ””โ”€โ”€ llama_cpp.py        # GGUF conversion
โ”œโ”€โ”€ hub/                     # HuggingFace integration
โ”‚   โ”œโ”€โ”€ hub_manager.py      # Push/pull models
โ”‚   โ””โ”€โ”€ model_card.py       # Auto model cards
โ”œโ”€โ”€ kernels/                 # Custom kernels
โ”‚   โ””โ”€โ”€ triton/             # Fused operations
โ””โ”€โ”€ utils/                   # Utilities
    โ””โ”€โ”€ progress.py         # Beautiful UI

๐Ÿค Contributing

git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
pip install -e ".[dev]"
pytest

Areas for contribution:

  • ๐Ÿ†• New model architectures
  • ๐Ÿ”ง Performance optimizations
  • ๐Ÿ“š Documentation
  • ๐Ÿ› Bug fixes

๐Ÿ“œ License

MIT License โ€” see LICENSE for details.


Made with ๐Ÿงก by Dark Coder

โญ Star on GitHub โ€ข ๐Ÿ› Report Bug โ€ข ๐Ÿ’– Sponsor

Happy Quantizing! ๐Ÿš€

About

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •  

Languages