You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[docs] Replace deprecated configs with Config objects (#2375)
**Summary:** We still mention old, deprecated "configs" like
`int4_weight_only` in many user-facing docs. This commit replaces
these occurrences with the actual corresponding config objects.
**Test Plan:**
```
git grep int4_weight_only
git grep int8_dynamic_activation_
git grep quantize_
git grep sparsify_
```
* What happens: linear.weight = torch.nn.Parameter(to_affine_quantized_intx(linear.weight), requires_grad=False)
206
206
* quantization primitive ops: choose_qparams and quantize_affine are called to quantize the Tensor
207
207
* quantized Tensor will be `AffineQuantizedTensor`, a quantized tensor with derived dtype (e.g. int4 with scale and zero_point)
@@ -212,10 +212,10 @@ During Model Execution: model(input)
212
212
213
213
During Quantization
214
214
###################
215
-
First we start with the API call: ``quantize_(model, int4_weight_only())`` what this does is it converts the weights of nn.Linear modules in the model to int4 quantized tensor (``AffineQuantizedTensor`` that is int4 dtype, asymmetric, per group quantized), using the layout for tinygemm kernel: ``tensor_core_tiled`` layout.
215
+
First we start with the API call: ``quantize_(model, Int4WeightOnlyConfig())`` what this does is it converts the weights of nn.Linear modules in the model to int4 quantized tensor (``AffineQuantizedTensor`` that is int4 dtype, asymmetric, per group quantized), using the layout for tinygemm kernel: ``tensor_core_tiled`` layout.
216
216
217
-
* `quantize_ <https://github.com/pytorch/ao/blob/4865ee61340cc63a1469f437388067b853c9289e/torchao/quantization/quant_api.py#L403>`__: the model level API that quantizes the weight of linear by applying the conversion function from user (second argument)
218
-
* `int4_weight_only<https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/quantization/quant_api.py#L522>`__: the function that returns a function that converts weight of linear to int4 weight only quantized weight
217
+
* `quantize_ <https://docs.pytorch.org/ao/main/generated/torchao.quantization.quantize_.html#torchao.quantization.quantize_>`__: the model level API that quantizes the weight of linear by applying the conversion function from user (second argument)
218
+
* `Int4WeightOnlyConfig<https://docs.pytorch.org/ao/main/generated/torchao.quantization.Int4WeightOnlyConfig.html#torchao.quantization.Int4WeightOnlyConfig>`__: the function that returns a function that converts weight of linear to int4 weight only quantized weight
219
219
* Calls quantization primitives ops like choose_qparams_affine and quantize_affine to quantize the Tensor
220
220
* `TensorCoreTiledLayout <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/dtypes/affine_quantized_tensor.py#L573>`__: the tensor core tiled layout type, storing parameters for the packing format
221
221
* `TensorCoreTiledAQTTensorImpl <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/dtypes/affine_quantized_tensor.py#L1376>`__: the tensor core tiled TensorImpl, stores the packed weight for efficient int4 weight only kernel (tinygemm kernel)
The size of the quantized model is typically going to be smaller to the original floating point model, but it also depends on the specific techinque and implementation you are using. You can print the model size with ``torchao.utils.get_model_size_in_bytes`` utility function, specifically for the above example using int4_weight_only quantization, we can see the size reduction is around 4x::
73
+
The size of the quantized model is typically going to be smaller to the original floating point model, but it also depends on the specific techinque and implementation you are using. You can print the model size with ``torchao.utils.get_model_size_in_bytes`` utility function, specifically for the above example using Int4WeightOnlyConfig quantization, we can see the size reduction is around 4x::
Copy file name to clipboardExpand all lines: torchao/quantization/README.md
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -381,7 +381,7 @@ We're trying to develop kernels for low bit quantization for intx quantization f
381
381
382
382
You try can out these apis with the `quantize_` api as above alongside the config `UIntXWeightOnlyConfig`. An example can be found in in `torchao/_models/llama/generate.py`.
We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used.
386
386
387
387
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
@@ -390,7 +390,7 @@ We have kernels that do 8-bit dynamic quantization of activations and uintx grou
You can try out these apis with the `quantize_` api as above alongside the constructor `int8_dynamic_activation_intx_weight`. An example can be found in `torchao/_models/llama/generate.py`.
393
+
You can try out these apis with the `quantize_` api as above alongside the config `Int8DynamicActivationIntxWeightConfig`. An example can be found in `torchao/_models/llama/generate.py`.
394
394
395
395
### Codebook Quantization
396
396
The benchmarks below were run on a single NVIDIA-A6000 GPU.
@@ -402,7 +402,7 @@ The benchmarks below were run on a single NVIDIA-A6000 GPU.
You try can out these apis with the `quantize_` api as above alongside the constructor `codebook_weight_only` an example can be found in in `torchao/_models/llama/generate.py`.
405
+
You try can out these apis with the `quantize_` api as above alongside the config `CodebookWeightOnlyConfig` an example can be found in in `torchao/_models/llama/generate.py`.
406
406
407
407
### GPTQ Quantization
408
408
We have a GPTQ quantization workflow that can be used to quantize a model to int4. More details can be found in [GPTQ](./GPTQ/README.md),
Copy file name to clipboardExpand all lines: torchao/sparsity/README.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -52,12 +52,12 @@ These benchmarks were also ran on a NVIDIA-A100-80GB.
52
52
Sparse-Marlin 2:4 is an optimized GPU kernel that extends the Mixed Auto-Regressive Linear (Marlin) dense kernel to support 4-bit quantized weights and 2:4 sparsity, improving performance in matrix multiplication and accumulation. Full documentation can be found [here](https://github.com/IST-DASLab/Sparse-Marlin).
53
53
54
54
```py
55
-
from torchao.quantization.quant_api import quantize_, int4_weight_only
55
+
from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig
Note the existing API results in an extremely high accuracy degredation and is intended to be used in concert with an already sparsified+finetuned checkpoint where possible until we develop
@@ -68,11 +68,11 @@ the necessary supporting flows in torchao.
68
68
We support composing int8 dynaic quantization with 2:4 sparsity. We fuse one of the scalar dequant multiplications into our cuSPARSELt sparse mm in order to remain performant.
69
69
70
70
```py
71
-
from torchao.quantization.quant_api import quantize_, int8_dynamic_activation_int8_weight
71
+
from torchao.quantization.quant_api import quantize_, Int8DynamicActivationInt8WeightConfig
0 commit comments