Skip to content

Commit 7ffce59

Browse files
authored
[docs] Replace deprecated configs with Config objects (#2375)
**Summary:** We still mention old, deprecated "configs" like `int4_weight_only` in many user-facing docs. This commit replaces these occurrences with the actual corresponding config objects. **Test Plan:** ``` git grep int4_weight_only git grep int8_dynamic_activation_ git grep quantize_ git grep sparsify_ ```
1 parent 6243040 commit 7ffce59

File tree

7 files changed

+19
-20
lines changed

7 files changed

+19
-20
lines changed

docs/source/api_ref_sparsity.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ torchao.sparsity
1212

1313
sparsify_
1414
semi_sparse_weight
15-
int8_dynamic_activation_int8_semi_sparse_weight
1615
apply_fake_sparsity
1716
WandaSparsifier
1817
PerChannelNormObserver

docs/source/quantization.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ First we want to lay out the torchao stack::
1212
Basic dtypes: uint1-uint7, int1-int8, float3-float8
1313

1414

15-
Any quantization algorithm will be using some components from the above stack, for example int4_weight_only quantization uses:
15+
Any quantization algorithm will be using some components from the above stack, for example int4 weight-only quantization uses:
1616
(1) weight only quantization flow
1717
(2) `tinygemm bf16 activation + int4 weight kernel <https://github.com/pytorch/pytorch/blob/136e28f616140fdc9fb78bb0390aeba16791f1e3/aten/src/ATen/native/native_functions.yaml#L4148>`__ and `quant primitive ops <https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_primitives.py>`__
1818
(3) `AffineQuantizedTensor <https://github.com/pytorch/ao/blob/main/torchao/dtypes/affine_quantized_tensor.py>`__ tensor subclass with `TensorCoreTiledLayout <https://github.com/pytorch/ao/blob/e41ca4ee41f5f1fe16c59e00cffb4dd33d25e56d/torchao/dtypes/affine_quantized_tensor.py#L573>`__
@@ -201,7 +201,7 @@ Case Study: How int4 weight only quantization works in torchao?
201201
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
202202
To connect everything together, here is a more detailed walk through for how int4 weight only quantization is implemented in torchao.
203203

204-
Quantization Flow: quantize_(model, int4_weight_only())
204+
Quantization Flow: quantize_(model, Int4WeightOnlyConfig())
205205
* What happens: linear.weight = torch.nn.Parameter(to_affine_quantized_intx(linear.weight), requires_grad=False)
206206
* quantization primitive ops: choose_qparams and quantize_affine are called to quantize the Tensor
207207
* quantized Tensor will be `AffineQuantizedTensor`, a quantized tensor with derived dtype (e.g. int4 with scale and zero_point)
@@ -212,10 +212,10 @@ During Model Execution: model(input)
212212

213213
During Quantization
214214
###################
215-
First we start with the API call: ``quantize_(model, int4_weight_only())`` what this does is it converts the weights of nn.Linear modules in the model to int4 quantized tensor (``AffineQuantizedTensor`` that is int4 dtype, asymmetric, per group quantized), using the layout for tinygemm kernel: ``tensor_core_tiled`` layout.
215+
First we start with the API call: ``quantize_(model, Int4WeightOnlyConfig())`` what this does is it converts the weights of nn.Linear modules in the model to int4 quantized tensor (``AffineQuantizedTensor`` that is int4 dtype, asymmetric, per group quantized), using the layout for tinygemm kernel: ``tensor_core_tiled`` layout.
216216

217-
* `quantize_ <https://github.com/pytorch/ao/blob/4865ee61340cc63a1469f437388067b853c9289e/torchao/quantization/quant_api.py#L403>`__: the model level API that quantizes the weight of linear by applying the conversion function from user (second argument)
218-
* `int4_weight_only <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/quantization/quant_api.py#L522>`__: the function that returns a function that converts weight of linear to int4 weight only quantized weight
217+
* `quantize_ <https://docs.pytorch.org/ao/main/generated/torchao.quantization.quantize_.html#torchao.quantization.quantize_>`__: the model level API that quantizes the weight of linear by applying the conversion function from user (second argument)
218+
* `Int4WeightOnlyConfig <https://docs.pytorch.org/ao/main/generated/torchao.quantization.Int4WeightOnlyConfig.html#torchao.quantization.Int4WeightOnlyConfig>`__: the function that returns a function that converts weight of linear to int4 weight only quantized weight
219219
* Calls quantization primitives ops like choose_qparams_affine and quantize_affine to quantize the Tensor
220220
* `TensorCoreTiledLayout <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/dtypes/affine_quantized_tensor.py#L573>`__: the tensor core tiled layout type, storing parameters for the packing format
221221
* `TensorCoreTiledAQTTensorImpl <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/dtypes/affine_quantized_tensor.py#L1376>`__: the tensor core tiled TensorImpl, stores the packed weight for efficient int4 weight only kernel (tinygemm kernel)

docs/source/quick_start.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,8 @@ for efficient mixed dtype matrix multiplication:
5656
.. code:: py
5757
5858
# torch 2.4+ only
59-
from torchao.quantization import int4_weight_only, quantize_
60-
quantize_(model, int4_weight_only(group_size=32))
59+
from torchao.quantization import Int4WeightOnlyConfig, quantize_
60+
quantize_(model, Int4WeightOnlyConfig(group_size=32))
6161
6262
The quantized model is now ready to use! Note that the quantization
6363
logic is inserted through tensor subclasses, so there is no change

docs/source/serialization.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Here is the serialization and deserialization flow::
1414
from torchao.utils import get_model_size_in_bytes
1515
from torchao.quantization.quant_api import (
1616
quantize_,
17-
int4_weight_only,
17+
Int4WeightOnlyConfig,
1818
)
1919

2020
class ToyLinearModel(torch.nn.Module):
@@ -36,7 +36,7 @@ Here is the serialization and deserialization flow::
3636
print(f"original model size: {get_model_size_in_bytes(m) / 1024 / 1024} MB")
3737

3838
example_inputs = m.example_inputs(dtype=dtype, device="cuda")
39-
quantize_(m, int4_weight_only())
39+
quantize_(m, Int4WeightOnlyConfig())
4040
print(f"quantized model size: {get_model_size_in_bytes(m) / 1024 / 1024} MB")
4141

4242
ref = m(*example_inputs)
@@ -70,7 +70,7 @@ quantized model ``state_dict``::
7070
{"linear1.weight": quantized_weight1, "linear2.weight": quantized_weight2, ...}
7171

7272

73-
The size of the quantized model is typically going to be smaller to the original floating point model, but it also depends on the specific techinque and implementation you are using. You can print the model size with ``torchao.utils.get_model_size_in_bytes`` utility function, specifically for the above example using int4_weight_only quantization, we can see the size reduction is around 4x::
73+
The size of the quantized model is typically going to be smaller to the original floating point model, but it also depends on the specific techinque and implementation you are using. You can print the model size with ``torchao.utils.get_model_size_in_bytes`` utility function, specifically for the above example using Int4WeightOnlyConfig quantization, we can see the size reduction is around 4x::
7474

7575
original model size: 4.0 MB
7676
quantized model size: 1.0625 MB

scripts/quick_start.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
import torch
99

10-
from torchao.quantization import int4_weight_only, quantize_
10+
from torchao.quantization import Int4WeightOnlyConfig, quantize_
1111
from torchao.utils import (
1212
TORCH_VERSION_AT_LEAST_2_5,
1313
benchmark_model,
@@ -43,7 +43,7 @@ def forward(self, x):
4343
# ========================
4444

4545
# torch 2.4+ only
46-
quantize_(model, int4_weight_only(group_size=32))
46+
quantize_(model, Int4WeightOnlyConfig(group_size=32))
4747

4848

4949
# =============

torchao/quantization/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -381,7 +381,7 @@ We're trying to develop kernels for low bit quantization for intx quantization f
381381

382382
You try can out these apis with the `quantize_` api as above alongside the config `UIntXWeightOnlyConfig`. An example can be found in in `torchao/_models/llama/generate.py`.
383383

384-
### int8_dynamic_activation_intx_weight Quantization
384+
### Int8DynamicActivationIntxWeightConfig Quantization
385385
We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used.
386386

387387
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
@@ -390,7 +390,7 @@ We have kernels that do 8-bit dynamic quantization of activations and uintx grou
390390
| | int8_dynamic_activation_intx_weight-4-256-false | 16.03 | 65.81 | NA | 4.11 |
391391
| | int8_dynamic_activation_intx_weight-3-256-false | 18.94 | 59.97 | NA | 3.17 |
392392

393-
You can try out these apis with the `quantize_` api as above alongside the constructor `int8_dynamic_activation_intx_weight`. An example can be found in `torchao/_models/llama/generate.py`.
393+
You can try out these apis with the `quantize_` api as above alongside the config `Int8DynamicActivationIntxWeightConfig`. An example can be found in `torchao/_models/llama/generate.py`.
394394

395395
### Codebook Quantization
396396
The benchmarks below were run on a single NVIDIA-A6000 GPU.
@@ -402,7 +402,7 @@ The benchmarks below were run on a single NVIDIA-A6000 GPU.
402402
| Llama-3.1-8B| Base (bfloat16) | 7.713 | 32.16 | 482.70 | 16.35 | 15.01 |
403403
| | codebook-4-64 | 10.095 | 1.73 | 8.63 | 23.11 | 4.98 |
404404

405-
You try can out these apis with the `quantize_` api as above alongside the constructor `codebook_weight_only` an example can be found in in `torchao/_models/llama/generate.py`.
405+
You try can out these apis with the `quantize_` api as above alongside the config `CodebookWeightOnlyConfig` an example can be found in in `torchao/_models/llama/generate.py`.
406406

407407
### GPTQ Quantization
408408
We have a GPTQ quantization workflow that can be used to quantize a model to int4. More details can be found in [GPTQ](./GPTQ/README.md),

torchao/sparsity/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,12 @@ These benchmarks were also ran on a NVIDIA-A100-80GB.
5252
Sparse-Marlin 2:4 is an optimized GPU kernel that extends the Mixed Auto-Regressive Linear (Marlin) dense kernel to support 4-bit quantized weights and 2:4 sparsity, improving performance in matrix multiplication and accumulation. Full documentation can be found [here](https://github.com/IST-DASLab/Sparse-Marlin).
5353

5454
```py
55-
from torchao.quantization.quant_api import quantize_, int4_weight_only
55+
from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig
5656
from torchao.dtypes import MarlinSparseLayout
5757

5858
# Your FP16 model
5959
model = model.cuda().half()
60-
quantize_(model, int4_weight_only(layout=MarlinSparseLayout()))
60+
quantize_(model, Int4WeightOnlyConfig(layout=MarlinSparseLayout()))
6161
```
6262

6363
Note the existing API results in an extremely high accuracy degredation and is intended to be used in concert with an already sparsified+finetuned checkpoint where possible until we develop
@@ -68,11 +68,11 @@ the necessary supporting flows in torchao.
6868
We support composing int8 dynaic quantization with 2:4 sparsity. We fuse one of the scalar dequant multiplications into our cuSPARSELt sparse mm in order to remain performant.
6969

7070
```py
71-
from torchao.quantization.quant_api import quantize_, int8_dynamic_activation_int8_weight
71+
from torchao.quantization.quant_api import quantize_, Int8DynamicActivationInt8WeightConfig
7272
from torchao.dtypes import SemiSparseLayout
7373

7474
model = model.cuda()
75-
quantize_(model, int8_dynamic_activation_int8_weight(layout=SemiSparseLayout()))
75+
quantize_(model, Int8DynamicActivationInt8WeightConfig(layout=SemiSparseLayout()))
7676
```
7777

7878
### 2:4 sparsity

0 commit comments

Comments
 (0)