π QuantLLM v2.0
The Ultra-Fast LLM Quantization & Export Library
Load β Quantize β Finetune β Export β All in One Line
π What's New in v2.0
We're excited to announce QuantLLM v2.0 β a complete redesign focused on simplicity, performance, and developer experience. This release transforms LLM quantization from a complex multi-step process into a single, intuitive workflow.
β¨ Key Features
π₯ TurboModel: One API to Rule Them All
Gone are the days of juggling multiple libraries. TurboModel unifies everything:
from quantllm import turbo
# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")
# Generate text instantly
response = model.generate("Explain quantum computing in simple terms")
# Fine-tune with one line
model.finetune(dataset, epochs=3)
# Export to any format
model.export("gguf", quantization="Q4_K_M") # β llama.cpp, Ollama, LM Studio
model.export("onnx") # β ONNX Runtime, TensorRT
model.export("mlx", quantization="4bit") # β Apple Siliconπ¦ Multi-Format Export
Export your models to any deployment target:
| Format | Use Case | Platforms |
|---|---|---|
| GGUF | llama.cpp, Ollama, LM Studio | Windows, Linux, macOS |
| ONNX | ONNX Runtime, TensorRT | Cross-platform |
| MLX | Apple Silicon optimized | macOS (M1/M2/M3/M4) |
| SafeTensors | HuggingFace Transformers | Cross-platform |
π― Native GGUF Export β No C++ Required!
Forget compiling llama.cpp or wrestling with C++ toolchains:
# Just worksβ’ β on any platform
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")- β Pure Python implementation
- β
All quantization types:
Q2_KβQ8_0,F16,F32 - β Windows, Linux, macOS β zero configuration
- β Automatic llama.cpp installation when needed
π€ One-Click Hub Publishing
Push directly to HuggingFace with auto-generated model cards:
model.push(
"your-username/my-awesome-model",
format="gguf",
quantization="Q4_K_M"
)Auto-generated features:
- π Proper YAML frontmatter (
library_name,tags,base_model) - π Format-specific usage examples
- π "Use this model" button compatibility
- π Quantization details and benchmarks
π¨ Beautiful Developer Experience
Themed Progress & Logging
A cohesive orange theme across all interactions:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π QuantLLM v2.0.0 β
β Ultra-fast LLM Quantization & Export β
β β
β β GGUF β ONNX β MLX β SafeTensors β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SmartConfig Auto-Detection
See exactly what's happening before loading:
βββββββββββββββββββββββββββββββββββββββββββ
β π Model Analysis β
βββββββββββββββββββββββββββββββββββββββββββ€
β Parameters 7.24B β
β Original 14.5 GB β
β Quantized 4.2 GB (71% saved) β
β GPU Memory Available: 24 GB β β
βββββββββββββββββββββββββββββββββββββββββββ
Clean Console Output
- π Suppressed HuggingFace/Datasets noise
- π Rich progress bars with ETA
- β Clear success/error indicators
- π― Actionable error messages
β‘ Performance Optimizations
| Feature | Improvement |
|---|---|
torch.compile |
Up to 2x faster training |
| Dynamic Padding | 30-50% less VRAM usage |
| Flash Attention 2 | Auto-enabled when available |
| Gradient Checkpointing | Automatic for large models |
| Memory Optimization | expandable_segments prevents OOM |
π Bug Fixes
- FIXED:
TypeError: object of type 'generator' has no len()during GGUF export - FIXED:
ValueError: model did not return a losswith properDataCollatorForLanguageModeling - FIXED:
AttributeErrorwhen usingSmartConfigwithtorch.dtypeobjects - FIXED: BitsAndBytes models now properly dequantize before GGUF conversion
- FIXED: ONNX export now uses Optimum for correct graph conversion
- CHANGED: WandB disabled by default (enable with
WANDB_DISABLED="false")
π¦ Installation
# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With ONNX support
pip install "quantllm[onnx]"
# With MLX support (Apple Silicon)
pip install "quantllm[mlx]"
# Full installation (all features)
pip install "quantllm[full]"π Quick Start
from quantllm import turbo
# Load with automatic 4-bit quantization
model = turbo("meta-llama/Llama-3.2-3B")
# Chat
print(model.generate("What is machine learning?"))
# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
# Or push directly to HuggingFace
model.push("username/my-model", format="gguf", quantization="Q4_K_M")π Documentation
π Acknowledgments
Special thanks to the open-source community and all contributors who made this release possible.
Made with π§‘ by Dark Coder
β Star on GitHub Β·
π Report Bug Β·
π Sponsor
Happy Quantizing! π