Skip to content

The Ultra-Fast LLM Quantization & Export Library

Latest

Choose a tag to compare

@codewithdark-git codewithdark-git released this 19 Dec 13:59
· 4 commits to main since this release

πŸš€ QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

PyPI License Python

Load β†’ Quantize β†’ Finetune β†’ Export β€” All in One Line


πŸŽ‰ What's New in v2.0

We're excited to announce QuantLLM v2.0 β€” a complete redesign focused on simplicity, performance, and developer experience. This release transforms LLM quantization from a complex multi-step process into a single, intuitive workflow.


✨ Key Features

πŸ”₯ TurboModel: One API to Rule Them All

Gone are the days of juggling multiple libraries. TurboModel unifies everything:

from quantllm import turbo

# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text instantly
response = model.generate("Explain quantum computing in simple terms")

# Fine-tune with one line
model.finetune(dataset, epochs=3)

# Export to any format
model.export("gguf", quantization="Q4_K_M")    # β†’ llama.cpp, Ollama, LM Studio
model.export("onnx")                            # β†’ ONNX Runtime, TensorRT
model.export("mlx", quantization="4bit")        # β†’ Apple Silicon

πŸ“¦ Multi-Format Export

Export your models to any deployment target:

Format Use Case Platforms
GGUF llama.cpp, Ollama, LM Studio Windows, Linux, macOS
ONNX ONNX Runtime, TensorRT Cross-platform
MLX Apple Silicon optimized macOS (M1/M2/M3/M4)
SafeTensors HuggingFace Transformers Cross-platform

🎯 Native GGUF Export β€” No C++ Required!

Forget compiling llama.cpp or wrestling with C++ toolchains:

# Just worksβ„’ β€” on any platform
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
  • βœ… Pure Python implementation
  • βœ… All quantization types: Q2_K β†’ Q8_0, F16, F32
  • βœ… Windows, Linux, macOS β€” zero configuration
  • βœ… Automatic llama.cpp installation when needed

πŸ€— One-Click Hub Publishing

Push directly to HuggingFace with auto-generated model cards:

model.push(
    "your-username/my-awesome-model",
    format="gguf",
    quantization="Q4_K_M"
)

Auto-generated features:

  • πŸ“‹ Proper YAML frontmatter (library_name, tags, base_model)
  • πŸ“– Format-specific usage examples
  • πŸ”˜ "Use this model" button compatibility
  • πŸ“Š Quantization details and benchmarks

🎨 Beautiful Developer Experience

Themed Progress & Logging

A cohesive orange theme across all interactions:

╔════════════════════════════════════════════════════════════╗
β•‘                                                            β•‘
β•‘   πŸš€ QuantLLM v2.0.0                                       β•‘
β•‘   Ultra-fast LLM Quantization & Export                     β•‘
β•‘                                                            β•‘
β•‘   βœ“ GGUF  βœ“ ONNX  βœ“ MLX  βœ“ SafeTensors                     β•‘
β•‘                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

SmartConfig Auto-Detection

See exactly what's happening before loading:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ“Š Model Analysis                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Parameters    7.24B                    β”‚
β”‚  Original      14.5 GB                  β”‚
β”‚  Quantized     4.2 GB (71% saved)       β”‚
β”‚  GPU Memory    Available: 24 GB βœ“       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Clean Console Output

  • πŸ”‡ Suppressed HuggingFace/Datasets noise
  • πŸ“Š Rich progress bars with ETA
  • βœ… Clear success/error indicators
  • 🎯 Actionable error messages

⚑ Performance Optimizations

Feature Improvement
torch.compile Up to 2x faster training
Dynamic Padding 30-50% less VRAM usage
Flash Attention 2 Auto-enabled when available
Gradient Checkpointing Automatic for large models
Memory Optimization expandable_segments prevents OOM

πŸ› Bug Fixes

  • FIXED: TypeError: object of type 'generator' has no len() during GGUF export
  • FIXED: ValueError: model did not return a loss with proper DataCollatorForLanguageModeling
  • FIXED: AttributeError when using SmartConfig with torch.dtype objects
  • FIXED: BitsAndBytes models now properly dequantize before GGUF conversion
  • FIXED: ONNX export now uses Optimum for correct graph conversion
  • CHANGED: WandB disabled by default (enable with WANDB_DISABLED="false")

πŸ“¦ Installation

# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With ONNX support
pip install "quantllm[onnx]"

# With MLX support (Apple Silicon)
pip install "quantllm[mlx]"

# Full installation (all features)
pip install "quantllm[full]"

πŸš€ Quick Start

from quantllm import turbo

# Load with automatic 4-bit quantization
model = turbo("meta-llama/Llama-3.2-3B")

# Chat
print(model.generate("What is machine learning?"))

# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# Or push directly to HuggingFace
model.push("username/my-model", format="gguf", quantization="Q4_K_M")

πŸ“š Documentation


πŸ™ Acknowledgments

Special thanks to the open-source community and all contributors who made this release possible.


Made with 🧑 by Dark Coder

⭐ Star on GitHub ·
πŸ› Report Bug Β·
πŸ’– Sponsor


Happy Quantizing! πŸš€