1
- # 🧠 QuantLLM: Lightweight Library for Quantized LLM Fine-Tuning and Deployment
1
+ # 🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment
2
2
3
3
[ ![ PyPI Downloads] ( https://static.pepy.tech/badge/quantllm )] ( https://pepy.tech/projects/quantllm )
4
4
<img alt =" PyPI - Version " src =" https://img.shields.io/pypi/v/quantllm?logo=pypi&label=version& " >
5
5
6
-
7
6
## 📌 Overview
8
7
9
- ** QuantLLM** is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) ** efficiently** using ** 4-bit and 8-bit quantization** techniques. It provides a modular and flexible framework for:
10
-
11
- - ** Loading and quantizing models** with advanced configurations
12
- - ** LoRA / QLoRA-based fine-tuning** with customizable parameters
13
- - ** Dataset management** with preprocessing and splitting
14
- - ** Training and evaluation** with comprehensive metrics
15
- - ** Model checkpointing** and versioning
16
- - ** Hugging Face Hub integration** for model sharing
8
+ ** QuantLLM** is a Python library designed for efficient model quantization using the GGUF (GGML Universal Format) method. It provides a robust framework for converting and deploying large language models with minimal memory footprint and optimal performance. Key capabilities include:
17
9
18
- The goal of QuantLLM is to ** democratize LLM training** , especially in low-resource environments, while keeping the workflow intuitive, modular, and production-ready.
10
+ - ** Memory-efficient GGUF quantization** with multiple precision options (2-bit to 8-bit)
11
+ - ** Chunk-based processing** for handling large models
12
+ - ** Comprehensive benchmarking** tools
13
+ - ** Detailed progress tracking** with memory statistics
14
+ - ** Easy model export** and deployment
19
15
20
16
## 🎯 Key Features
21
17
22
18
| Feature | Description |
23
19
| ----------------------------------| -------------|
24
- | ✅ Quantized Model Loading | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. |
25
- | ✅ Advanced Dataset Management | Load, preprocess, and split datasets with flexible configurations |
26
- | ✅ LoRA / QLoRA Fine-Tuning | Memory-efficient fine-tuning with customizable LoRA parameters |
27
- | ✅ Comprehensive Training | Advanced training loop with mixed precision, gradient accumulation, and early stopping |
28
- | ✅ Model Evaluation | Flexible evaluation with custom metrics and batch processing |
29
- | ✅ Checkpoint Management | Save, resume, and manage training checkpoints with versioning |
30
- | ✅ Hub Integration | Push models and checkpoints to Hugging Face Hub with authentication |
31
- | ✅ Configuration Management | YAML/JSON config support for reproducible experiments |
32
- | ✅ Logging and Monitoring | Comprehensive logging and Weights & Biases integration |
20
+ | ✅ Multiple GGUF Types | Support for various GGUF quantization types (Q2_K to Q8_0) with different precision-size tradeoffs |
21
+ | ✅ Memory Optimization | Chunk-based processing and CPU offloading for efficient handling of large models |
22
+ | ✅ Progress Tracking | Detailed layer-wise progress with memory statistics and ETA |
23
+ | ✅ Benchmarking Tools | Comprehensive benchmarking suite for performance evaluation |
24
+ | ✅ Hardware Optimization | Automatic device selection and memory management |
25
+ | ✅ Easy Deployment | Simple conversion to GGUF format for deployment |
26
+ | ✅ Flexible Configuration | Customizable quantization parameters and processing options |
33
27
34
28
## 🚀 Getting Started
35
29
36
30
### Installation
37
31
32
+ Basic installation:
38
33
``` bash
39
34
pip install quantllm
40
35
```
41
36
37
+ With GGUF support (recommended):
38
+ ``` bash
39
+ pip install quantllm[gguf]
40
+ ```
41
+
42
+ ### Quick Example
43
+
44
+ ``` python
45
+ from quantllm import QuantLLM
46
+ from transformers import AutoTokenizer
47
+
48
+ # Load tokenizer and prepare data
49
+ model_name = " facebook/opt-125m"
50
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
51
+ calibration_text = [" Example text for calibration." ] * 10
52
+ calibration_data = tokenizer(calibration_text, return_tensors = " pt" , padding = True )[" input_ids" ]
53
+
54
+ # Quantize model
55
+ quantized_model, benchmark_results = QuantLLM.quantize_from_pretrained(
56
+ model_name_or_path = model_name,
57
+ bits = 4 , # Quantization bits (2-8)
58
+ group_size = 32 , # Group size for quantization
59
+ quant_type = " Q4_K_M" , # GGUF quantization type
60
+ calibration_data = calibration_data,
61
+ benchmark = True , # Run benchmarks
62
+ benchmark_input_shape = (1 , 32 )
63
+ )
64
+
65
+ # Save and convert to GGUF
66
+ QuantLLM.save_quantized_model(model = quantized_model, output_path = " quantized_model" )
67
+ QuantLLM.convert_to_gguf(model = quantized_model, output_path = " model.gguf" )
68
+ ```
69
+
42
70
For detailed usage examples and API documentation, please refer to our:
43
71
- 📚 [ Official Documentation] ( https://quantllm.readthedocs.io/ )
44
72
- 🎓 [ Tutorials] ( https://quantllm.readthedocs.io/tutorials/ )
@@ -48,39 +76,41 @@ For detailed usage examples and API documentation, please refer to our:
48
76
49
77
### Minimum Requirements
50
78
- ** CPU** : 4+ cores
51
- - ** RAM** : 16GB
52
- - ** Storage** : 20GB free space
53
- - ** Python** : 3.8 +
79
+ - ** RAM** : 16GB+
80
+ - ** Storage** : 10GB+ free space
81
+ - ** Python** : 3.10 +
54
82
55
- ### Recommended Requirements
83
+ ### Recommended for Large Models
84
+ - ** CPU** : 8+ cores
85
+ - ** RAM** : 32GB+
56
86
- ** GPU** : NVIDIA GPU with 8GB+ VRAM
57
- - ** RAM** : 32GB
58
- - ** Storage** : 50GB+ SSD
59
87
- ** CUDA** : 11.7+
88
+ - ** Storage** : 20GB+ free space
89
+
90
+ ### GGUF Quantization Types
60
91
61
- ### Resource Usage Guidelines
62
- | Model Size | 4-bit (GPU RAM) | 8-bit (GPU RAM) | CPU RAM (min) |
63
- | ------------ | ---------------- | ----------------- | --------------- |
64
- | 3B params | ~ 6GB | ~ 9GB | 16GB |
65
- | 7B params | ~ 12GB | ~ 18GB | 32GB |
66
- | 13B params | ~ 20GB | ~ 32GB | 64GB |
67
- | 70B params | ~ 90GB | ~ 140GB | 256GB |
92
+ | Type | Bits | Description | Use Case |
93
+ | --------- | ------ | ----------------------- | ----------------------------- |
94
+ | Q2_K | 2 | Extreme compression | Size-critical deployment |
95
+ | Q3_K_S | 3 | Small size | Limited storage |
96
+ | Q4_K_M | 4 | Balanced quality | General use |
97
+ | Q5_K_M | 5 | Higher quality | Quality-sensitive tasks |
98
+ | Q8_0 | 8 | Best quality | Accuracy-critical tasks |
68
99
69
100
## 🔄 Version Compatibility
70
101
71
102
| QuantLLM | Python | PyTorch | Transformers | CUDA |
72
103
| ----------| --------| ----------| --------------| -------|
73
- | latest | ≥3.10 | ≥2.0.0 | ≥4.30.0 | ≥11.7 |
104
+ | 1.2.0 | ≥3.10 | ≥2.0.0 | ≥4.30.0 | ≥11.7 |
74
105
75
106
## 🗺 Roadmap
76
107
77
- - [ ] Multi-GPU training support
78
- - [ ] AutoML for hyperparameter tuning
79
- - [ ] Integration of additional advanced quantization algorithms and techniques.
80
- - [ ] Custom model architecture support
81
- - [ ] Enhanced logging and visualization
82
- - [ ] Model compression techniques
83
- - [ ] Deployment optimizations
108
+ - [ ] Support for more GGUF model architectures
109
+ - [ ] Enhanced benchmarking capabilities
110
+ - [ ] Multi-GPU processing support
111
+ - [ ] Advanced memory optimization techniques
112
+ - [ ] Integration with more deployment platforms
113
+ - [ ] Custom quantization kernels
84
114
85
115
## 🤝 Contributing
86
116
@@ -92,14 +122,12 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
92
122
93
123
## 🙏 Acknowledgments
94
124
95
- - [ HuggingFace] ( https://huggingface.co/ ) for their amazing Transformers library
96
- - [ bitsandbytes] ( https://github.com/TimDettmers/bitsandbytes ) for quantization
97
- - [ PEFT] ( https://github.com/huggingface/peft ) for parameter-efficient fine-tuning
98
- - [ Weights & Biases] ( https://wandb.ai/ ) for experiment tracking
125
+ - [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) for GGUF format
126
+ - [ HuggingFace] ( https://huggingface.co/ ) for Transformers library
127
+ - [ CTransformers] ( https://github.com/marella/ctransformers ) for GGUF support
99
128
100
129
## 📫 Contact & Support
101
130
102
- - GitHub Issues: [ Create an issue] ( https://github.com/yourusername /QuantLLM/issues )
131
+ - GitHub Issues: [ Create an issue] ( https://github.com/codewithdark-git /QuantLLM/issues )
103
132
- Documentation: [ Read the docs] ( https://quantllm.readthedocs.io/ )
104
- - Discord: [ Join our community] ( https://discord.gg/quantllm )
105
-
133
+
0 commit comments