[RFC] Plans for torchao 

### Summary

Last year, we released [pytorch-labs/torchao](https://github.com/pytorch-labs/ao) to provide acceleration of Generative AI models using native PyTorch techniques. Torchao added support for running quantization on GPUs, including int8 dynamic quantization (W8A8) and weight-only quantization (int8 and int4) that were composable with torch.compile. Combined, the APIs launched in torchao were able to power SOTA generative AI models across multiple modalities: Segment Anything, Stable Diffusion, and LLaMa. 
The results were showcased in these blog posts - 
https://pytorch.org/blog/accelerating-generative-ai/, 
https://pytorch.org/blog/accelerating-generative-ai-2/, 
https://pytorch.org/blog/accelerating-generative-ai-3/  

Our investment in torchao is to accelerate Generative AI, using native PyTorch features, ensuring composability with torch.compile. 

In 2024, we plan to adopt the following strategy for development of torchao

- We will launch torchao with the most important quantization techniques for LLMs and other GenAI models via a simple UX. Examples - GPTQ, AWQ, int8 dynamic quant. 
- We will stay on top of SOTA kernels within these spaces through to PTC and commit cpu/gpu kernels ourselves as necessary. Torchao will host a limited set of performant kernels for server (cpu/gpu) and executorch, with a clear recommendation on how to integrate and run inference on these backends.
- Torchao will host non-standard [dtypes](https://github.com/pytorch-labs/ao/tree/main/torchao/dtypes), implemented via tensor subclasses. Examples - nf4, any4, mx4
- Following the PyTorch [design principle](https://pytorch.org/docs/stable/community/design.html#design-principles), the offerings of torchao will be usable and simple, including setup, dependencies, API surfaces.  
- We actively engage with the community/researchers to contribute new quantization techniques in native PyTorch code and developers to author performant kernels for these into torchao for different backends. An example would be to upstream the kernels built by the [CUDA_MODE community](https://discord.gg/cuda-mode-1189498204333543425) into torchao. 
- As the code gets more mature/based on community demand - we will upstream techniques/kernels into PyTorch Core.


Let’s dive deeper into some of the coverage areas mentioned above. 

### Emerging dtypes
Dtypes like NF4, MX4, groupwise quantized int4 are used for implementing various optimization techniques in the models. Last year, we posted a [plan](https://dev-discuss.pytorch.org/t/supporting-new-dtypes-in-pytorch/1833/2) on how we wish to support these dtypes in PyTorch. In torchao, we will host tensor subclass based implementation of dtypes, existing examples include [uint4](https://github.com/pytorch-labs/ao/blob/main/torchao/dtypes/uint4.py) and [NF4](https://github.com/pytorch-labs/ao/blob/main/torchao/dtypes/nf4tensor.py) that users can use for their own quantization techniques or override the implementation to support other dtypes that might be useful.
Moreover, users don’t need to write triton or cuda kernels for their custom dtypes. The implementation can be in python and torch.compile will take care of generating performant kernels under the hood.

### Quantization techniques
Quantization can be done on only weights or weights+activations. Typically LLM quantization techniques for BS 1 (memory BW bound) use weight-only quantization techniques. But for larger batch sizes, or longer context length cases or for general throughput bound models quantizing the activations is also beneficial. Quantization, however, impacts the model accuracy and researchers have published techniques to mitigate this accuracy impact which currently exist externally as one repository per technique.

In torchao, we will plan to support the following class of techniques using PyTorch, made available via a simple UX and following the one-file-per-technique principle. 

**LLM weight only quantization techniques**

**Post training quantization**
The two most popular techniques externally are GTPQ and AWQ, available via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) and [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) which include the technique as well as the performant kernels for faster quantization ops. 
To that end, we will start by re-implementing the GPTQ and AWQ techniques into torchao using PyTorch via a simple/intuitive UX that supports saving/loading of quantized models, while realizing the memory savings on disk. Some open questions we need to address here include - 
How much VRAM will be required for different quantization techniques
How do we convert to-from weights quantized for different backends (cpu and gpu today use different weight packing format)

In the future, as more interesting and cutting edge techniques are introduced, researchers can directly implement them in torchao or our team can re-implement them in PyTorch.

**Weight and activation quantization techniques**

**Post training quantization**
We’ve already implemented W8A8 quantization via the int_mm kernel in core. This has shown speedup on models like SAM, SDXL without any impact to model accuracy and can be turned on via a simple one-line UX implemented via module swap or tensor subclass. 

However the challenge here is that some smaller layer shapes might not benefit from quantization due to the overhead in quantizing and dequantizing the activation tensors. Users can either statically ignore quantizing these layers or have a higher level API that figures out which layers are sensitive to quantization. We plan to provide a higher level API via the [auto quantizer](https://github.com/pytorch-labs/ao/pull/38) that applies this technique to the layers that stand to benefit the most to provide the benefits of quantization without having to worry too much about the configs to use.

**Quantization aware training** 
Techniques here require access to fine-tuning, to tune the model to reduce accuracy impact of quantization. Recently, research like [LLM-QAT](https://arxiv.org/abs/2305.17888) is promising, showing that we can go down to W4A8 and 4-bit KV cache for LLMs. Moreover, newer lower bit techniques like AQLM, Quip# also include a component of fine-tuning to improve the model accuracy.

We will include the APIs and workflow to enable users to do QAT on LLMs, starting with implementing the LLM-QAT paper in torchao and further extending it to support other dtypes like MX4.

 
### Optimized kernels

**Kernels** 
Optimized kernels are key to making models run faster during inference. Today, in core we already have performant kernels like `int_mm` and 4-bit weight quantization kernels for cpu (via [intel](https://github.com/pytorch/pytorch/pull/117475)) and gpu (via tinygemm). torchao will host performant kernels that will work with different backends with a guide on how to plug in these kernels into PyTorch models via the custom ops API. These kernels will compose with torch.compile, with the expectation that the user is expected to write a meta kernel implementation for this. For executorch, the expectation is that if the user provides a kernel that works with executorch then it should also work in eager mode.

We will also directly engage with the community, to upstream their performant kernels into torchao.

**Autotuner**

In order to use any CUDA kernel efficiently, we'll need to pick the right kernel hyperparameters. For an eager mode kernel, the same is true as well. A kernel autotuner will help here. We expect that the auto quantizer along with the kernel autotuner will make int8 dynamic quantization and int8/int4 weight-only quantization more usable and performant. A WIP example of what this might look like can be found [here](https://github.com/pytorch-labs/ao/pull/41/files).

**Release engineering**

Shipping optimized, custom kernels requires extensibility mechanisms and release channels. We have custom operator support that integrates broadly, but our release mechanism might need to be optimized. It can be quite difficult to ship custom binaries across a broad range of operating systems and accelerators. 

### Conversion to-from popular model formats
We can add a conversion util from popular model storage formats like gguf into PyTorch’s state_dict format. This will enable users to take a pre-existing quantized model from llama.cpp and have it run via PyTorch eager mode for desktop cpu/gpu and executorch for on-device cases. We’ll share more details here soon.

### Pruning
In addition to quantization, we’ve seen promising results with sparsity as well on GPUs. We will share more updates on what torchao will host for the space of sparsity/pruning in the near future.

We'd love to hear any feedback or questions from the OSS community on this RFC. Thank you!

cc @msaroufim @cpuhrsch @jerryzh168 @HDCharles @andrewor14 @jcaip @jisaacso 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Plans for torchao #47

Summary

Emerging dtypes

Quantization techniques

Optimized kernels

Conversion to-from popular model formats

Pruning

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Plans for torchao #47

Description

Summary

Emerging dtypes

Quantization techniques

Optimized kernels

Conversion to-from popular model formats

Pruning

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions