|
1 | | -# GGUF Export API |
| 1 | +# 📦 GGUF API |
2 | 2 |
|
3 | | -Pure Python GGUF conversion - no llama.cpp required. |
| 3 | +Export models to GGUF format for llama.cpp, Ollama, and LM Studio. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Quick Reference |
| 8 | + |
| 9 | +```python |
| 10 | +from quantllm import turbo, convert_to_gguf, quantize_gguf |
| 11 | + |
| 12 | +# Method 1: Via TurboModel |
| 13 | +model = turbo("meta-llama/Llama-3.2-3B") |
| 14 | +model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M") |
| 15 | + |
| 16 | +# Method 2: Direct conversion |
| 17 | +convert_to_gguf("meta-llama/Llama-3.2-3B", "model.Q4_K_M.gguf", quant_type="Q4_K_M") |
| 18 | + |
| 19 | +# Method 3: Re-quantize existing GGUF |
| 20 | +quantize_gguf("model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M") |
| 21 | +``` |
| 22 | + |
| 23 | +--- |
4 | 24 |
|
5 | 25 | ## convert_to_gguf() |
6 | 26 |
|
7 | | -High-level function for GGUF conversion. |
| 27 | +Convert a HuggingFace model to GGUF format. |
8 | 28 |
|
9 | 29 | ```python |
10 | 30 | def convert_to_gguf( |
11 | | - model, |
12 | | - tokenizer, |
| 31 | + model_path: str, |
13 | 32 | output_path: str, |
14 | | - quant_type: str = "q4_0", |
| 33 | + quant_type: str = "Q4_K_M", |
| 34 | + model_dtype: str = "auto", |
15 | 35 | verbose: bool = True, |
16 | 36 | ) -> str |
17 | 37 | ``` |
18 | 38 |
|
19 | | -**Parameters:** |
| 39 | +### Parameters |
20 | 40 |
|
21 | 41 | | Parameter | Type | Default | Description | |
22 | 42 | |-----------|------|---------|-------------| |
23 | | -| `model` | PreTrainedModel | required | HuggingFace model | |
24 | | -| `tokenizer` | PreTrainedTokenizer | required | Tokenizer | |
| 43 | +| `model_path` | str | required | HuggingFace model name or local path | |
25 | 44 | | `output_path` | str | required | Output .gguf file path | |
26 | | -| `quant_type` | str | "q4_0" | Quantization type | |
| 45 | +| `quant_type` | str | "Q4_K_M" | Quantization type | |
| 46 | +| `model_dtype` | str | "auto" | Model dtype (auto, f16, f32) | |
27 | 47 | | `verbose` | bool | True | Show progress | |
28 | 48 |
|
29 | | -**Returns:** Path to created GGUF file. |
| 49 | +### Returns |
| 50 | + |
| 51 | +Path to the created GGUF file. |
| 52 | + |
| 53 | +### Example |
30 | 54 |
|
31 | | -**Example:** |
32 | 55 | ```python |
33 | 56 | from quantllm import convert_to_gguf |
34 | | -from transformers import AutoModelForCausalLM, AutoTokenizer |
35 | 57 |
|
36 | | -model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B") |
37 | | -tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B") |
| 58 | +# Basic conversion |
| 59 | +convert_to_gguf( |
| 60 | + "meta-llama/Llama-3.2-3B", |
| 61 | + "llama3.Q4_K_M.gguf", |
| 62 | + quant_type="Q4_K_M" |
| 63 | +) |
38 | 64 |
|
39 | | -convert_to_gguf(model, tokenizer, "output.gguf", "q4_0") |
| 65 | +# Higher quality |
| 66 | +convert_to_gguf( |
| 67 | + "meta-llama/Llama-3.2-3B", |
| 68 | + "llama3.Q8_0.gguf", |
| 69 | + quant_type="Q8_0" |
| 70 | +) |
40 | 71 | ``` |
41 | 72 |
|
42 | 73 | --- |
43 | 74 |
|
44 | | -## list_quant_types() |
| 75 | +## quantize_gguf() |
45 | 76 |
|
46 | | -Get available quantization types. |
| 77 | +Re-quantize an existing GGUF file to a different quantization type. |
47 | 78 |
|
48 | 79 | ```python |
49 | | -def list_quant_types() -> Dict[str, str] |
| 80 | +def quantize_gguf( |
| 81 | + input_path: str, |
| 82 | + output_path: str, |
| 83 | + quant_type: str = "Q4_K_M", |
| 84 | +) -> str |
50 | 85 | ``` |
51 | 86 |
|
52 | | -**Returns:** Dictionary of type names and descriptions. |
| 87 | +### Parameters |
| 88 | + |
| 89 | +| Parameter | Type | Default | Description | |
| 90 | +|-----------|------|---------|-------------| |
| 91 | +| `input_path` | str | required | Input GGUF file path | |
| 92 | +| `output_path` | str | required | Output GGUF file path | |
| 93 | +| `quant_type` | str | "Q4_K_M" | Target quantization type | |
| 94 | + |
| 95 | +### Example |
| 96 | + |
| 97 | +```python |
| 98 | +from quantllm import quantize_gguf |
| 99 | + |
| 100 | +# Re-quantize F16 to Q4_K_M |
| 101 | +quantize_gguf( |
| 102 | + "model.F16.gguf", |
| 103 | + "model.Q4_K_M.gguf", |
| 104 | + quant_type="Q4_K_M" |
| 105 | +) |
| 106 | +``` |
53 | 107 |
|
54 | 108 | --- |
55 | 109 |
|
56 | | -## QUANT_TYPES |
| 110 | +## GGUF_QUANT_TYPES |
57 | 111 |
|
58 | | -Registry of quantization types with details. |
| 112 | +Available quantization types. |
59 | 113 |
|
60 | 114 | ```python |
61 | | -QUANT_TYPES = { |
62 | | - "f32": QuantizationInfo("F32", ...), |
63 | | - "f16": QuantizationInfo("F16", ...), |
64 | | - "q8_0": QuantizationInfo("Q8_0", ...), |
65 | | - "q4_0": QuantizationInfo("Q4_0", ...), |
66 | | - # ... more types |
67 | | -} |
| 115 | +from quantllm import GGUF_QUANT_TYPES |
| 116 | + |
| 117 | +print(GGUF_QUANT_TYPES) |
| 118 | +# ['Q2_K', 'Q3_K_S', 'Q3_K_M', 'Q3_K_L', 'Q4_K_S', 'Q4_K_M', |
| 119 | +# 'Q5_K_S', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16', 'F32'] |
68 | 120 | ``` |
69 | 121 |
|
| 122 | +### Quantization Comparison |
| 123 | + |
| 124 | +| Type | Bits | Quality | Size (7B) | Use Case | |
| 125 | +|------|------|---------|-----------|----------| |
| 126 | +| `Q2_K` | 2 | Low | ~2 GB | Extreme compression | |
| 127 | +| `Q3_K_S` | 3 | Fair | ~2.5 GB | Small devices | |
| 128 | +| `Q3_K_M` | 3 | Fair | ~3 GB | Constrained memory | |
| 129 | +| `Q4_K_S` | 4 | Good | ~3.5 GB | Balanced (smaller) | |
| 130 | +| `Q4_K_M` | 4 | Good | ~4 GB | **Recommended** ⭐ | |
| 131 | +| `Q5_K_S` | 5 | High | ~4.5 GB | Quality focus | |
| 132 | +| `Q5_K_M` | 5 | High | ~5 GB | Quality balance | |
| 133 | +| `Q6_K` | 6 | Very High | ~5.5 GB | Near original | |
| 134 | +| `Q8_0` | 8 | Excellent | ~7 GB | Maximum quality | |
| 135 | +| `F16` | 16 | Original | ~14 GB | Full precision | |
| 136 | + |
70 | 137 | --- |
71 | 138 |
|
72 | | -## GGUFWriter |
| 139 | +## QUANT_RECOMMENDATIONS |
73 | 140 |
|
74 | | -Low-level GGUF file writer. |
| 141 | +Get recommendations based on hardware. |
75 | 142 |
|
76 | 143 | ```python |
77 | | -class GGUFWriter: |
78 | | - def __init__(self, output_path: str, arch: str = "llama"): |
79 | | - ... |
80 | | - |
81 | | - def add_architecture(self): |
82 | | - """Add architecture metadata.""" |
83 | | - |
84 | | - def add_string(self, key: str, value: str): |
85 | | - """Add string metadata.""" |
86 | | - |
87 | | - def add_uint32(self, key: str, value: int): |
88 | | - """Add uint32 metadata.""" |
89 | | - |
90 | | - def add_tensor(self, name: str, tensor: Tensor, quant_type: str): |
91 | | - """Add a tensor to be quantized and written.""" |
92 | | - |
93 | | - def write(self, show_progress: bool = True): |
94 | | - """Write the GGUF file.""" |
95 | | -``` |
96 | | - |
97 | | -**Example:** |
| 144 | +from quantllm import QUANT_RECOMMENDATIONS |
| 145 | + |
| 146 | +print(QUANT_RECOMMENDATIONS) |
| 147 | +# { |
| 148 | +# 'low_memory': 'Q3_K_M', # <6 GB VRAM |
| 149 | +# 'balanced': 'Q4_K_M', # 6-12 GB VRAM (recommended) |
| 150 | +# 'quality': 'Q5_K_M', # 12-24 GB VRAM |
| 151 | +# 'high_quality': 'Q6_K', # >24 GB VRAM |
| 152 | +# 'maximum': 'Q8_0', # Maximum quality |
| 153 | +# } |
| 154 | +``` |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## check_llama_cpp() |
| 159 | + |
| 160 | +Check if llama.cpp is installed. |
| 161 | + |
98 | 162 | ```python |
99 | | -from quantllm import GGUFWriter |
| 163 | +def check_llama_cpp() -> bool |
| 164 | +``` |
100 | 165 |
|
101 | | -writer = GGUFWriter("custom.gguf", arch="llama") |
102 | | -writer.add_architecture() |
103 | | -writer.add_uint32("llama.context_length", 4096) |
| 166 | +### Example |
104 | 167 |
|
105 | | -for name, param in model.named_parameters(): |
106 | | - writer.add_tensor(name, param, "q4_0") |
| 168 | +```python |
| 169 | +from quantllm import check_llama_cpp |
107 | 170 |
|
108 | | -writer.write() |
| 171 | +if check_llama_cpp(): |
| 172 | + print("llama.cpp is ready!") |
| 173 | +else: |
| 174 | + print("llama.cpp not found") |
109 | 175 | ``` |
110 | 176 |
|
111 | 177 | --- |
112 | 178 |
|
113 | | -## FastQuantizer |
| 179 | +## install_llama_cpp() |
114 | 180 |
|
115 | | -Fast quantization kernels. |
| 181 | +Install llama.cpp automatically. |
116 | 182 |
|
117 | 183 | ```python |
118 | | -class FastQuantizer: |
119 | | - @staticmethod |
120 | | - def quantize_q8_0(tensor: Tensor) -> bytes: |
121 | | - """Quantize to Q8_0 format.""" |
122 | | - |
123 | | - @staticmethod |
124 | | - def quantize_q4_0(tensor: Tensor) -> bytes: |
125 | | - """Quantize to Q4_0 format.""" |
126 | | - |
127 | | - @staticmethod |
128 | | - def quantize_f16(tensor: Tensor) -> bytes: |
129 | | - """Convert to FP16.""" |
| 184 | +def install_llama_cpp( |
| 185 | + install_dir: str = "./llama.cpp", |
| 186 | + force: bool = False, |
| 187 | +) -> str |
130 | 188 | ``` |
| 189 | + |
| 190 | +### Parameters |
| 191 | + |
| 192 | +| Parameter | Type | Default | Description | |
| 193 | +|-----------|------|---------|-------------| |
| 194 | +| `install_dir` | str | "./llama.cpp" | Installation directory | |
| 195 | +| `force` | bool | False | Force reinstall | |
| 196 | + |
| 197 | +### Example |
| 198 | + |
| 199 | +```python |
| 200 | +from quantllm import install_llama_cpp |
| 201 | + |
| 202 | +# Install to default location |
| 203 | +install_llama_cpp() |
| 204 | + |
| 205 | +# Install to custom location |
| 206 | +install_llama_cpp("./tools/llama.cpp") |
| 207 | +``` |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## ensure_llama_cpp_installed() |
| 212 | + |
| 213 | +Ensure llama.cpp is installed, installing if needed. |
| 214 | + |
| 215 | +```python |
| 216 | +def ensure_llama_cpp_installed() -> str |
| 217 | +``` |
| 218 | + |
| 219 | +### Example |
| 220 | + |
| 221 | +```python |
| 222 | +from quantllm import ensure_llama_cpp_installed |
| 223 | + |
| 224 | +# Automatically installs if not present |
| 225 | +llama_path = ensure_llama_cpp_installed() |
| 226 | +print(f"llama.cpp at: {llama_path}") |
| 227 | +``` |
| 228 | + |
| 229 | +--- |
| 230 | + |
| 231 | +## export_to_gguf() |
| 232 | + |
| 233 | +High-level export function (deprecated, use `convert_to_gguf`). |
| 234 | + |
| 235 | +```python |
| 236 | +def export_to_gguf( |
| 237 | + model, |
| 238 | + tokenizer, |
| 239 | + output_path: str, |
| 240 | + quant_type: str = "Q4_K_M", |
| 241 | +) -> str |
| 242 | +``` |
| 243 | + |
| 244 | +--- |
| 245 | + |
| 246 | +## Using Exported Models |
| 247 | + |
| 248 | +### llama.cpp |
| 249 | + |
| 250 | +```bash |
| 251 | +./llama-cli -m model.Q4_K_M.gguf -p "Hello!" -n 100 |
| 252 | +``` |
| 253 | + |
| 254 | +### Ollama |
| 255 | + |
| 256 | +```bash |
| 257 | +echo 'FROM ./model.Q4_K_M.gguf' > Modelfile |
| 258 | +ollama create mymodel -f Modelfile |
| 259 | +ollama run mymodel |
| 260 | +``` |
| 261 | + |
| 262 | +### LM Studio |
| 263 | + |
| 264 | +1. Import the `.gguf` file |
| 265 | +2. Start chatting |
| 266 | + |
| 267 | +### Python (llama-cpp-python) |
| 268 | + |
| 269 | +```python |
| 270 | +from llama_cpp import Llama |
| 271 | + |
| 272 | +llm = Llama(model_path="model.Q4_K_M.gguf") |
| 273 | +output = llm("Hello!", max_tokens=100) |
| 274 | +print(output["choices"][0]["text"]) |
| 275 | +``` |
| 276 | + |
| 277 | +--- |
| 278 | + |
| 279 | +## See Also |
| 280 | + |
| 281 | +- [GGUF Export Guide](../guide/gguf-export.md) — Detailed guide |
| 282 | +- [TurboModel.export()](turbomodel.md#export) — Export via TurboModel |
| 283 | +- [Hub Integration](hub.md) — Push GGUF to HuggingFace |
0 commit comments