Skip to content

Commit 12abb06

Browse files
Update README and documentation for QuantLLM v2.0: Revise project description, enhance installation instructions, and improve model loading and export examples. Add new features and usage guidelines for GGUF, ONNX, and MLX formats, along with updated quantization
1 parent 934b237 commit 12abb06

File tree

15 files changed

+2399
-718
lines changed

15 files changed

+2399
-718
lines changed

README.md

Lines changed: 209 additions & 223 deletions
Large diffs are not rendered by default.

docs/api/gguf.md

Lines changed: 226 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,130 +1,283 @@
1-
# GGUF Export API
1+
# 📦 GGUF API
22

3-
Pure Python GGUF conversion - no llama.cpp required.
3+
Export models to GGUF format for llama.cpp, Ollama, and LM Studio.
4+
5+
---
6+
7+
## Quick Reference
8+
9+
```python
10+
from quantllm import turbo, convert_to_gguf, quantize_gguf
11+
12+
# Method 1: Via TurboModel
13+
model = turbo("meta-llama/Llama-3.2-3B")
14+
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
15+
16+
# Method 2: Direct conversion
17+
convert_to_gguf("meta-llama/Llama-3.2-3B", "model.Q4_K_M.gguf", quant_type="Q4_K_M")
18+
19+
# Method 3: Re-quantize existing GGUF
20+
quantize_gguf("model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M")
21+
```
22+
23+
---
424

525
## convert_to_gguf()
626

7-
High-level function for GGUF conversion.
27+
Convert a HuggingFace model to GGUF format.
828

929
```python
1030
def convert_to_gguf(
11-
model,
12-
tokenizer,
31+
model_path: str,
1332
output_path: str,
14-
quant_type: str = "q4_0",
33+
quant_type: str = "Q4_K_M",
34+
model_dtype: str = "auto",
1535
verbose: bool = True,
1636
) -> str
1737
```
1838

19-
**Parameters:**
39+
### Parameters
2040

2141
| Parameter | Type | Default | Description |
2242
|-----------|------|---------|-------------|
23-
| `model` | PreTrainedModel | required | HuggingFace model |
24-
| `tokenizer` | PreTrainedTokenizer | required | Tokenizer |
43+
| `model_path` | str | required | HuggingFace model name or local path |
2544
| `output_path` | str | required | Output .gguf file path |
26-
| `quant_type` | str | "q4_0" | Quantization type |
45+
| `quant_type` | str | "Q4_K_M" | Quantization type |
46+
| `model_dtype` | str | "auto" | Model dtype (auto, f16, f32) |
2747
| `verbose` | bool | True | Show progress |
2848

29-
**Returns:** Path to created GGUF file.
49+
### Returns
50+
51+
Path to the created GGUF file.
52+
53+
### Example
3054

31-
**Example:**
3255
```python
3356
from quantllm import convert_to_gguf
34-
from transformers import AutoModelForCausalLM, AutoTokenizer
3557

36-
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B")
37-
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B")
58+
# Basic conversion
59+
convert_to_gguf(
60+
"meta-llama/Llama-3.2-3B",
61+
"llama3.Q4_K_M.gguf",
62+
quant_type="Q4_K_M"
63+
)
3864

39-
convert_to_gguf(model, tokenizer, "output.gguf", "q4_0")
65+
# Higher quality
66+
convert_to_gguf(
67+
"meta-llama/Llama-3.2-3B",
68+
"llama3.Q8_0.gguf",
69+
quant_type="Q8_0"
70+
)
4071
```
4172

4273
---
4374

44-
## list_quant_types()
75+
## quantize_gguf()
4576

46-
Get available quantization types.
77+
Re-quantize an existing GGUF file to a different quantization type.
4778

4879
```python
49-
def list_quant_types() -> Dict[str, str]
80+
def quantize_gguf(
81+
input_path: str,
82+
output_path: str,
83+
quant_type: str = "Q4_K_M",
84+
) -> str
5085
```
5186

52-
**Returns:** Dictionary of type names and descriptions.
87+
### Parameters
88+
89+
| Parameter | Type | Default | Description |
90+
|-----------|------|---------|-------------|
91+
| `input_path` | str | required | Input GGUF file path |
92+
| `output_path` | str | required | Output GGUF file path |
93+
| `quant_type` | str | "Q4_K_M" | Target quantization type |
94+
95+
### Example
96+
97+
```python
98+
from quantllm import quantize_gguf
99+
100+
# Re-quantize F16 to Q4_K_M
101+
quantize_gguf(
102+
"model.F16.gguf",
103+
"model.Q4_K_M.gguf",
104+
quant_type="Q4_K_M"
105+
)
106+
```
53107

54108
---
55109

56-
## QUANT_TYPES
110+
## GGUF_QUANT_TYPES
57111

58-
Registry of quantization types with details.
112+
Available quantization types.
59113

60114
```python
61-
QUANT_TYPES = {
62-
"f32": QuantizationInfo("F32", ...),
63-
"f16": QuantizationInfo("F16", ...),
64-
"q8_0": QuantizationInfo("Q8_0", ...),
65-
"q4_0": QuantizationInfo("Q4_0", ...),
66-
# ... more types
67-
}
115+
from quantllm import GGUF_QUANT_TYPES
116+
117+
print(GGUF_QUANT_TYPES)
118+
# ['Q2_K', 'Q3_K_S', 'Q3_K_M', 'Q3_K_L', 'Q4_K_S', 'Q4_K_M',
119+
# 'Q5_K_S', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16', 'F32']
68120
```
69121

122+
### Quantization Comparison
123+
124+
| Type | Bits | Quality | Size (7B) | Use Case |
125+
|------|------|---------|-----------|----------|
126+
| `Q2_K` | 2 | Low | ~2 GB | Extreme compression |
127+
| `Q3_K_S` | 3 | Fair | ~2.5 GB | Small devices |
128+
| `Q3_K_M` | 3 | Fair | ~3 GB | Constrained memory |
129+
| `Q4_K_S` | 4 | Good | ~3.5 GB | Balanced (smaller) |
130+
| `Q4_K_M` | 4 | Good | ~4 GB | **Recommended**|
131+
| `Q5_K_S` | 5 | High | ~4.5 GB | Quality focus |
132+
| `Q5_K_M` | 5 | High | ~5 GB | Quality balance |
133+
| `Q6_K` | 6 | Very High | ~5.5 GB | Near original |
134+
| `Q8_0` | 8 | Excellent | ~7 GB | Maximum quality |
135+
| `F16` | 16 | Original | ~14 GB | Full precision |
136+
70137
---
71138

72-
## GGUFWriter
139+
## QUANT_RECOMMENDATIONS
73140

74-
Low-level GGUF file writer.
141+
Get recommendations based on hardware.
75142

76143
```python
77-
class GGUFWriter:
78-
def __init__(self, output_path: str, arch: str = "llama"):
79-
...
80-
81-
def add_architecture(self):
82-
"""Add architecture metadata."""
83-
84-
def add_string(self, key: str, value: str):
85-
"""Add string metadata."""
86-
87-
def add_uint32(self, key: str, value: int):
88-
"""Add uint32 metadata."""
89-
90-
def add_tensor(self, name: str, tensor: Tensor, quant_type: str):
91-
"""Add a tensor to be quantized and written."""
92-
93-
def write(self, show_progress: bool = True):
94-
"""Write the GGUF file."""
95-
```
96-
97-
**Example:**
144+
from quantllm import QUANT_RECOMMENDATIONS
145+
146+
print(QUANT_RECOMMENDATIONS)
147+
# {
148+
# 'low_memory': 'Q3_K_M', # <6 GB VRAM
149+
# 'balanced': 'Q4_K_M', # 6-12 GB VRAM (recommended)
150+
# 'quality': 'Q5_K_M', # 12-24 GB VRAM
151+
# 'high_quality': 'Q6_K', # >24 GB VRAM
152+
# 'maximum': 'Q8_0', # Maximum quality
153+
# }
154+
```
155+
156+
---
157+
158+
## check_llama_cpp()
159+
160+
Check if llama.cpp is installed.
161+
98162
```python
99-
from quantllm import GGUFWriter
163+
def check_llama_cpp() -> bool
164+
```
100165

101-
writer = GGUFWriter("custom.gguf", arch="llama")
102-
writer.add_architecture()
103-
writer.add_uint32("llama.context_length", 4096)
166+
### Example
104167

105-
for name, param in model.named_parameters():
106-
writer.add_tensor(name, param, "q4_0")
168+
```python
169+
from quantllm import check_llama_cpp
107170

108-
writer.write()
171+
if check_llama_cpp():
172+
print("llama.cpp is ready!")
173+
else:
174+
print("llama.cpp not found")
109175
```
110176

111177
---
112178

113-
## FastQuantizer
179+
## install_llama_cpp()
114180

115-
Fast quantization kernels.
181+
Install llama.cpp automatically.
116182

117183
```python
118-
class FastQuantizer:
119-
@staticmethod
120-
def quantize_q8_0(tensor: Tensor) -> bytes:
121-
"""Quantize to Q8_0 format."""
122-
123-
@staticmethod
124-
def quantize_q4_0(tensor: Tensor) -> bytes:
125-
"""Quantize to Q4_0 format."""
126-
127-
@staticmethod
128-
def quantize_f16(tensor: Tensor) -> bytes:
129-
"""Convert to FP16."""
184+
def install_llama_cpp(
185+
install_dir: str = "./llama.cpp",
186+
force: bool = False,
187+
) -> str
130188
```
189+
190+
### Parameters
191+
192+
| Parameter | Type | Default | Description |
193+
|-----------|------|---------|-------------|
194+
| `install_dir` | str | "./llama.cpp" | Installation directory |
195+
| `force` | bool | False | Force reinstall |
196+
197+
### Example
198+
199+
```python
200+
from quantllm import install_llama_cpp
201+
202+
# Install to default location
203+
install_llama_cpp()
204+
205+
# Install to custom location
206+
install_llama_cpp("./tools/llama.cpp")
207+
```
208+
209+
---
210+
211+
## ensure_llama_cpp_installed()
212+
213+
Ensure llama.cpp is installed, installing if needed.
214+
215+
```python
216+
def ensure_llama_cpp_installed() -> str
217+
```
218+
219+
### Example
220+
221+
```python
222+
from quantllm import ensure_llama_cpp_installed
223+
224+
# Automatically installs if not present
225+
llama_path = ensure_llama_cpp_installed()
226+
print(f"llama.cpp at: {llama_path}")
227+
```
228+
229+
---
230+
231+
## export_to_gguf()
232+
233+
High-level export function (deprecated, use `convert_to_gguf`).
234+
235+
```python
236+
def export_to_gguf(
237+
model,
238+
tokenizer,
239+
output_path: str,
240+
quant_type: str = "Q4_K_M",
241+
) -> str
242+
```
243+
244+
---
245+
246+
## Using Exported Models
247+
248+
### llama.cpp
249+
250+
```bash
251+
./llama-cli -m model.Q4_K_M.gguf -p "Hello!" -n 100
252+
```
253+
254+
### Ollama
255+
256+
```bash
257+
echo 'FROM ./model.Q4_K_M.gguf' > Modelfile
258+
ollama create mymodel -f Modelfile
259+
ollama run mymodel
260+
```
261+
262+
### LM Studio
263+
264+
1. Import the `.gguf` file
265+
2. Start chatting
266+
267+
### Python (llama-cpp-python)
268+
269+
```python
270+
from llama_cpp import Llama
271+
272+
llm = Llama(model_path="model.Q4_K_M.gguf")
273+
output = llm("Hello!", max_tokens=100)
274+
print(output["choices"][0]["text"])
275+
```
276+
277+
---
278+
279+
## See Also
280+
281+
- [GGUF Export Guide](../guide/gguf-export.md) — Detailed guide
282+
- [TurboModel.export()](turbomodel.md#export) — Export via TurboModel
283+
- [Hub Integration](hub.md) — Push GGUF to HuggingFace

0 commit comments

Comments
 (0)