Skip to content

Commit 5d3367c

Browse files
authored
Add advanced usage section to quantization README (google-ai-edge#112)
* Add advanced usage section to quantization README BUG=b/356164136 * address comments * address comments
1 parent d5d1dd6 commit 5d3367c

File tree

4 files changed

+44
-3
lines changed

4 files changed

+44
-3
lines changed

ai_edge_torch/generative/quantize/README.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,29 @@ Once converted, you will get a quantized `.tflite` model which will be ready for
1818

1919
In the current release, the following schemes are supported:
2020

21-
* Dynamic range quantization with FP32 activations and INT8 weights for linear ops
22-
* FP16 quantization with FP16 weights and FP32 activations and computation for all ops
21+
* Dynamic range quantization: FP32 activations, INT8 weights, and integer computation
22+
* Weight-only quantization: FP32 activations, INT8 weights, and floating point computation
23+
* FP16 quantization: FP16 weights, FP32 activations and floating point computation for all ops
24+
25+
These correspond to the available recipes in `quant_recipes.py`.
26+
27+
## Advanced usage
28+
29+
In addition to configuring quantization using pre-configured recipes in `quant_recipes.py`, users can also customize their recipes according to their specific needs using the `LayerQuantRecipe` and `GenerativeQuantRecipe` API.
30+
31+
`LayerQuantRecipe` specifies at a Generative API layer (`ai_edge_torch/generative/layers`) level how ops within should be quantized. `GenerativeQuantRecipe` specifies at a model level how each component of a Generative API model should be quantized. With these configuration classes, selective quantization can be configured as follows:
32+
33+
```
34+
def custom_selective_quantization_recipe() -> quant_config.QuantConfig:
35+
return quant_config.QuantConfig(
36+
generative_recipe=quant_recipe.GenerativeQuantRecipe(
37+
default=create_layer_quant_fp16(),
38+
embedding=create_layer_quant_int8_dynamic(),
39+
attention=create_layer_quant_int8_weight_only(),
40+
feedforward=create_layer_quant_int8_dynamic(),
41+
)
42+
)
43+
```
44+
45+
For example, this recipe specifies that the embedding table, attention, and feedforward layers should be quantized to INT8. Specifically, for attention layers the computation should be in FP32. All other ops should be quantized to the default scheme which is specified as FP16.
2346

24-
These correspond to the available recipes in `quant_recipes.py`

ai_edge_torch/generative/quantize/quant_recipe_utils.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,16 @@ def create_layer_quant_int8_dynamic() -> quant_recipe.LayerQuantRecipe:
4141
)
4242

4343

44+
def create_layer_quant_int8_weight_only() -> quant_recipe.LayerQuantRecipe:
45+
return quant_recipe.LayerQuantRecipe(
46+
activation_dtype=quant_attrs.Dtype.FP32,
47+
weight_dtype=quant_attrs.Dtype.INT8,
48+
mode=quant_attrs.Mode.WEIGHT_ONLY,
49+
algorithm=quant_attrs.Algorithm.MIN_MAX,
50+
granularity=quant_attrs.Granularity.CHANNELWISE,
51+
)
52+
53+
4454
def create_layer_quant_fp16() -> quant_recipe.LayerQuantRecipe:
4555
return quant_recipe.LayerQuantRecipe(
4656
activation_dtype=quant_attrs.Dtype.FP32,

ai_edge_torch/generative/quantize/quant_recipes.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,14 @@ def full_int8_dynamic_recipe() -> quant_config.QuantConfig:
4040
)
4141

4242

43+
def full_int8_weight_only_recipe() -> quant_config.QuantConfig:
44+
return quant_config.QuantConfig(
45+
generative_recipe=quant_recipe.GenerativeQuantRecipe(
46+
default=quant_recipe_utils.create_layer_quant_int8_weight_only(),
47+
)
48+
)
49+
50+
4351
def full_fp16_recipe() -> quant_config.QuantConfig:
4452
return quant_config.QuantConfig(
4553
generative_recipe=quant_recipe.GenerativeQuantRecipe(

ai_edge_torch/generative/test/test_quantize.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ def _feedforward_int8_dynamic_recipe() -> quant_config.QuantConfig:
111111
[
112112
(quant_recipes.full_fp16_recipe()),
113113
(quant_recipes.full_int8_dynamic_recipe()),
114+
(quant_recipes.full_int8_weight_only_recipe()),
114115
(_attention_int8_dynamic_recipe()),
115116
(_feedforward_int8_dynamic_recipe()),
116117
]

0 commit comments

Comments
 (0)