🚀 QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

Load → Quantize → Finetune → Export — All in One Line

🎉 What's New in v2.0

We're excited to announce QuantLLM v2.0 — a complete redesign focused on simplicity, performance, and developer experience. This release transforms LLM quantization from a complex multi-step process into a single, intuitive workflow.

✨ Key Features

🔥 TurboModel: One API to Rule Them All

Gone are the days of juggling multiple libraries. TurboModel unifies everything:

from quantllm import turbo

# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text instantly
response = model.generate("Explain quantum computing in simple terms")

# Fine-tune with one line
model.finetune(dataset, epochs=3)

# Export to any format
model.export("gguf", quantization="Q4_K_M")    # → llama.cpp, Ollama, LM Studio
model.export("onnx")                            # → ONNX Runtime, TensorRT
model.export("mlx", quantization="4bit")        # → Apple Silicon

📦 Multi-Format Export

Export your models to any deployment target:

Format	Use Case	Platforms
GGUF	llama.cpp, Ollama, LM Studio	Windows, Linux, macOS
ONNX	ONNX Runtime, TensorRT	Cross-platform
MLX	Apple Silicon optimized	macOS (M1/M2/M3/M4)
SafeTensors	HuggingFace Transformers	Cross-platform

🎯 Native GGUF Export — No C++ Required!

Forget compiling llama.cpp or wrestling with C++ toolchains:

# Just works™ — on any platform
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

✅ Pure Python implementation
✅ All quantization types: Q2_K → Q8_0, F16, F32
✅ Windows, Linux, macOS — zero configuration
✅ Automatic llama.cpp installation when needed

🤗 One-Click Hub Publishing

Push directly to HuggingFace with auto-generated model cards:

model.push(
    "your-username/my-awesome-model",
    format="gguf",
    quantization="Q4_K_M"
)

Auto-generated features:

📋 Proper YAML frontmatter (library_name, tags, base_model)
📖 Format-specific usage examples
🔘 "Use this model" button compatibility
📊 Quantization details and benchmarks

🎨 Beautiful Developer Experience

Themed Progress & Logging

A cohesive orange theme across all interactions:

╔════════════════════════════════════════════════════════════╗
║                                                            ║
║   🚀 QuantLLM v2.0.0                                       ║
║   Ultra-fast LLM Quantization & Export                     ║
║                                                            ║
║   ✓ GGUF  ✓ ONNX  ✓ MLX  ✓ SafeTensors                     ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

SmartConfig Auto-Detection

See exactly what's happening before loading:

┌─────────────────────────────────────────┐
│  📊 Model Analysis                      │
├─────────────────────────────────────────┤
│  Parameters    7.24B                    │
│  Original      14.5 GB                  │
│  Quantized     4.2 GB (71% saved)       │
│  GPU Memory    Available: 24 GB ✓       │
└─────────────────────────────────────────┘

Clean Console Output

🔇 Suppressed HuggingFace/Datasets noise
📊 Rich progress bars with ETA
✅ Clear success/error indicators
🎯 Actionable error messages

⚡ Performance Optimizations

Feature	Improvement
`torch.compile`	Up to 2x faster training
Dynamic Padding	30-50% less VRAM usage
Flash Attention 2	Auto-enabled when available
Gradient Checkpointing	Automatic for large models
Memory Optimization	`expandable_segments` prevents OOM

🐛 Bug Fixes

FIXED: TypeError: object of type 'generator' has no len() during GGUF export
FIXED: ValueError: model did not return a loss with proper DataCollatorForLanguageModeling
FIXED: AttributeError when using SmartConfig with torch.dtype objects
FIXED: BitsAndBytes models now properly dequantize before GGUF conversion
FIXED: ONNX export now uses Optimum for correct graph conversion
CHANGED: WandB disabled by default (enable with WANDB_DISABLED="false")

📦 Installation

# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With ONNX support
pip install "quantllm[onnx]"

# With MLX support (Apple Silicon)
pip install "quantllm[mlx]"

# Full installation (all features)
pip install "quantllm[full]"

🚀 Quick Start

from quantllm import turbo

# Load with automatic 4-bit quantization
model = turbo("meta-llama/Llama-3.2-3B")

# Chat
print(model.generate("What is machine learning?"))

# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# Or push directly to HuggingFace
model.push("username/my-model", format="gguf", quantization="Q4_K_M")

📚 Documentation

🙏 Acknowledgments

Special thanks to the open-source community and all contributors who made this release possible.

Made with 🧡 by Dark Coder

⭐ Star on GitHub ·
🐛 Report Bug ·
💖 Sponsor

Happy Quantizing! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The Ultra-Fast LLM Quantization & Export Library

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

🎉 What's New in v2.0

✨ Key Features

🔥 TurboModel: One API to Rule Them All

📦 Multi-Format Export

🎯 Native GGUF Export — No C++ Required!

🤗 One-Click Hub Publishing

🎨 Beautiful Developer Experience

Themed Progress & Logging

SmartConfig Auto-Detection

Clean Console Output

⚡ Performance Optimizations

🐛 Bug Fixes

📦 Installation

🚀 Quick Start

📚 Documentation

🙏 Acknowledgments

Made with 🧡 by Dark Coder

Uh oh!