pytorch
diff --git a/‎docs/source/finetuning.rst
Lines changed: 289 additions & 0 deletions b/‎docs/source/finetuning.rst
Lines changed: 289 additions & 0 deletions
diff --git a/‎docs/source/index.rst
Lines changed: 5 additions & 3 deletions b/‎docs/source/index.rst
Lines changed: 5 additions & 3 deletions
diff --git a/‎docs/source/pretraining.rst
Lines changed: 28 additions & 15 deletions b/‎docs/source/pretraining.rst
Lines changed: 28 additions & 15 deletions
@@ -0,0 +1,289 @@
+(Part 2) Fine-tuning with QAT, QLoRA, and float8
+------------------------------------------------
+
+TorchAO provides an end-to-end pre-training, fine-tuning, and serving
+model optimization flow by leveraging our quantization and sparsity
+techniques integrated into our partner frameworks. This is part 2 of 3
+such tutorials showcasing this end-to-end flow, focusing on the
+fine-tuning step.
+
+.. image:: ../static/e2e_flow_part2.png
+
+Fine-tuning is an important step for adapting your pre-trained model
+to more domain-specific data. In this tutorial, we demonstrate 3 model
+optimization techniques that can be applied to your model during fine-tuning:
+
+1. **Quantization-Aware Training (QAT)**, for adapting your model to
+quantization numerics during fine-tuning, with the goal of mitigating
+quantization degradations in your fine-tuned model when it is quantized
+eventually, e.g. in the serving step.
+
+2. **Quantized Low-Rank Adaptation (QLoRA)**, for reducing the resource
+requirement of fine-tuning by introducing small, trainable low-rank
+matrices and freezing the original pre-trained checkpoint, a type of
+Parameter-Efficient Fine-Tuning (PEFT).
+
+3. **Float8 Quantized Fine-tuning**, for speeding up fine-tuning by
+dynamically quantizing high precision weights and activations to float8,
+similar to `pre-training in float8 <pretraining.html>`__.
+
+
+Quantization-Aware Training (QAT)
+##################################
+
+The goal of Quantization-Aware Training is to adapt the model to
+quantization numerics during training or fine-tuning, so as to mitigate
+the inevitable quantization degradation when the model is actually
+quantized eventually, presumably during the serving step after fine-tuning.
+TorchAO's QAT support has been used successfully for the recent release of
+the `Llama-3.2 quantized 1B/3B <https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/>`__
+and the `LlamaGuard-3-8B <https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/8B/MODEL_CARD.md>`__ models to improve the quality of the quantized models.
+
+TorchAO's QAT support involves two separate steps: prepare and convert.
+The prepare step "fake" quantizes activations and/or weights during
+training, which means, the high precision values (e.g. bf16) are mapped
+to their corresponding quantized values *without* actually casting them
+to the target lower precision dtype (e.g. int4). The convert step,
+applied after training, replaces "fake" quantization operations in the
+model with "real" quantization that does perform the dtype casting:
+
+.. image:: ../../torchao/quantization/qat/images/qat_diagram.png
+
+There are multiple options for using TorchAO's QAT for fine-tuning:
+1. Directly use our QAT APIs with your own training loop
+2. Use our integration with `TorchTune <https://github.com/pytorch/torchtune>`__
+3. Use our integratino with `Axolotl <https://github.com/axolotl-ai-cloud/axolotl>`__
+
+
+Option 1: TorchAO QAT API
+=========================
+
+First, set up the model for fine-tuning on a single GPU:
+
+.. code:: py
+
+  import torch
+  from torchtune.models.llama3 import llama3
+
+  # Set up smaller version of llama3 to fit in a single GPU
+  def get_model():
+      return llama3(
+          vocab_size=4096,
+          num_layers=16,
+          num_heads=16,
+          num_kv_heads=4,
+          embed_dim=2048,
+          max_seq_len=2048,
+      ).cuda()
+
+  # Example training loop
+  def train_loop(m: torch.nn.Module):
+      optimizer = torch.optim.SGD(m.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-5)
+      loss_fn = torch.nn.CrossEntropyLoss()
+      for i in range(10):
+          example = torch.randint(0, 4096, (2, 16)).cuda()
+          target = torch.randn((2, 16, 4096)).cuda()
+          output = m(example)
+          loss = loss_fn(output, target)
+          loss.backward()
+          optimizer.step()
+          optimizer.zero_grad()
+
+Next, run the prepare step, which fake quantizes the model. In this example,
+we use int8 per token dynamic activations and int4 symmetric per group weights
+as our quantization scheme. Note that although we are targeting lower integer
+precisions, training still performs arithmetic in higher float precision (float32)
+because we are not actually casting the fake quantized values.
+
+.. code:: py
+
+  from torchao.quantization import (
+      quantize_,
+  )
+  from torchao.quantization.qat import (
+      FakeQuantizeConfig,
+      IntXQuantizationAwareTrainingConfig,
+  )
+  model = get_model()
+
+  # prepare: insert fake quantization ops
+  # swaps `torch.nn.Linear` with `FakeQuantizedLinear`
+  activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
+  weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
+  qat_config = IntXQuantizationAwareTrainingConfig(activation_config, weight_config)
+  quantize_(model, qat_config)
+
+  # fine-tune
+  train_loop(model)
+
+After fine-tuning, we end up with a model in the original high precision.
+This fine-tuned model has the exact same structure as the original model.
+The only difference is the QAT fine-tuned model has weights that are more
+attuned to quantization, which will be beneficial later during inference.
+The next step is to actually quantize the model:
+
+.. code:: py
+
+  from torchao.quantization import (
+      Int8DynamicActivationInt4WeightConfig,
+  )
+  from torchao.quantization.qat import (
+      FromIntXQuantizationAwareTrainingConfig,
+  )
+
+  # convert: transform fake quantization ops into actual quantized ops
+  # swap `FakeQuantizedLinear` back to `torch.nn.Linear` and inserts
+  # quantized activation and weight tensor subclasses
+  quantize_(model, FromIntXQuantizationAwareTrainingConfig())
+  quantize_(model, Int8DynamicActivationInt4WeightConfig(group_size=32))
+
+Now our model is ready for serving, and will typically have higher quantized
+accuracy than if we did not apply the prepare step (fake quantization) during
+fine-tuning. For example, when fine-tuning Llama-3.2-3B on the
+`OpenAssistant Conversations (OASST1) <https://huggingface.co/datasets/OpenAssistant/oasst1>`__
+dataset, we find that the quantized model achieved 3.4% higher accuracy
+with QAT than without, recovering 69.8% of the overall accuracy degradation
+from quantization:
+
+.. image:: ../static/qat_eval.png
+
+For full details of using TorchAO's QAT API, please refer to the `QAT README <https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md>`__.
+
+.. raw:: html
+
+   <details>
+   <summary><a>Alternative Legacy API</a></summary>
+
+The above `quantize_` API is the recommended flow for using TorchAO QAT.
+We also offer an alternative legacy "quantizer" API for specific quantization
+schemes, but these are not customizable unlike the above example.
+
+.. code::
+
+  from torchao.quantization.qat import Int8DynActInt4WeightQATQuantizer
+  qat_quantizer = Int8DynActInt4WeightQATQuantizer(group_size=32)
+
+  # prepare: insert fake quantization ops
+  # swaps `torch.nn.Linear` with `Int8DynActInt4WeightQATLinear`
+  model = qat_quantizer.prepare(model)
+
+  # train
+  train_loop(model)
+
+  # convert: transform fake quantization ops into actual quantized ops
+  # swaps `Int8DynActInt4WeightQATLinear` with `Int8DynActInt4WeightLinear`
+  model = qat_quantizer.convert(model)
+
+.. raw:: html
+
+   </details>
+
+
+Option 2: TorchTune QAT Integration
+===================================
+
+TorchAO's QAT support is integrated into TorchTune's distributed fine-tuning recipe.
+Instead of the following command, which applies full distributed fine-tuning without QAT:
+
+.. code::
+
+  tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config llama3_2/3B_full \
+      epochs=1 \
+      batch_size=16 \
+      dataset._component_=torchtune.datasets.alpaca_cleaned_dataset
+
+Users can run the following equivalent command instead. Note that specifying the quantizer
+is optional:
+
+.. code::
+
+  tune run --nnodes 1 --nproc_per_node 4 qat_distributed --config llama3_2/3B_qat_full \
+      epochs=1 \
+      batch_size=16 \
+      dataset._component_=torchtune.datasets.alpaca_cleaned_dataset \
+      quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQATQuantizer \
+      quantizer.groupsize=32
+
+After fine-tuning, users can quantize and evaluate the resulting model as follows.
+This is the same whether or not QAT was used during the fine-tuning process:
+
+.. code::
+
+  tune run quantize --config quantization \
+      model._component_=torchtune.models.llama3_2.llama3_2_3b \
+      checkpointer._component_=torchtune.training.FullModelHFCheckpointer \
+      'checkpointer.checkpoint_files=[model-00001-of-00002.safetensors,model-00002-of-00002.safetensors]' \
+      checkpointer.model_type=LLAMA3 \
+      quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \
+      quantizer.groupsize=32
+
+  tune run eleuther_eval --config eleuther_evaluation \
+      batch_size=1 \
+      'tasks=[hellaswag, wikitext]' \
+      model._component_=torchtune.models.llama3_2.llama3_2_3b \
+      checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \
+      'checkpointer.checkpoint_files=[model-00001-of-00002-8da4w.ckpt]' \
+      checkpointer.model_type=LLAMA3 \
+      tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \
+      tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model \
+      quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \
+      quantizer.groupsize=32
+
+This should print the following after fine-tuning:
+
+.. code::
+
+  |  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
+  |---------|------:|------|------|--------|---|-----:|---|-----:|
+  |hellaswag|      1|none  |None  |acc     |↑  |0.5021|±  |0.0050|
+  |         |       |none  |None  |acc_norm|↑  |0.6797|±  |0.0047|
+
+  | Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
+  |--------|------:|------|------|---------------|---|------:|---|------|
+  |wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.6965|±  |   N/A|
+  |        |       |none  |None  |byte_perplexity|↓  | 1.6206|±  |   N/A|
+  |        |       |none  |None  |word_perplexity|↓  |13.2199|±  |   N/A|
+
+You can compare these values with and without QAT to see how much QAT helped mitigate quantization degradation!
+
+In addition to vanilla QAT as in the above example, TorchAO's QAT can also be composed with LoRA to yield a `1.89x training speedup <https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700>`__. This is implemented in TorchTune's `QAT + LoRA fine-tuning recipe <https://github.com/pytorch/torchtune/blob/main/recipes/qat_lora_finetune_distributed.py>`__, which can be run using the following command:
+
+.. code::
+
+  tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3_2/3B_qat_lora \
+      epochs=1 \
+      batch_size=16 \
+      dataset._component_=torchtune.datasets.alpaca_cleaned_dataset \
+      quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQATQuantizer \
+      quantizer.groupsize=32
+
+For more details about how QAT is set up in TorchTune, please refer to `this tutorial <https://docs.pytorch.org/torchtune/main/tutorials/qat_finetune.html>`__.
+
+
+Option 3: Axolotl QAT Integration
+=================================
+
+Axolotl also recently added a QAT fine-tuning recipe that leverages TorchAO's QAT support.
+To get started, try fine-tuning Llama-3.2-3B with QAT using the following command:
+
+.. code::
+
+  axolotl train examples/llama-3/3b-qat-fsdp2.yaml
+  # once training is complete, perform the quantization step
+
+  axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml
+  # you should now have a quantized model saved in ./outputs/qat_out/quatized
+
+Please refer to the `Axolotl QAT documentation <https://docs.axolotl.ai/docs/qat.html>`__ for full details.
+
+
+Quantized Low-Rank Adaptation (QLoRA)
+#####################################
+
+(Coming soon!)
+
+
+Float8 Quantized Fine-tuning
+############################
+
+(Coming soon!)
@@ -37,9 +37,11 @@ for an overall introduction to the library and recent highlight and updates.
    :maxdepth: 1
    :caption: Tutorials
 
+   pretraining
+   finetuning
+   serving
+   torchao_vllm_integration
    serialization
+   static_quantization
    subclass_basic
    subclass_advanced
-   static_quantization
-   pretraining
-   torchao_vllm_integration
 
@@ -1,21 +1,29 @@
-Pretraining with float8
+(Part 1) Pre-training with float8
 ---------------------------------
 
-Pretraining with float8 using torchao can provide `up to 1.5x speedups <https://pytorch.org/blog/training-using-float8-fsdp2/>`__ on 512 GPU clusters,
+TorchAO provides an end-to-end pre-training, fine-tuning, and serving
+model optimization flow by leveraging our quantization and sparsity
+techniques integrated into our partner frameworks. This is part 1 of 3
+such tutorials showcasing this end-to-end flow, focusing on the
+pre-training step.
+
+.. image:: ../static/e2e_flow_part1.png
+
+Pre-training with float8 using torchao can provide `up to 1.5x speedups <https://pytorch.org/blog/training-using-float8-fsdp2/>`__ on 512 GPU clusters,
 and up to `1.34-1.43x speedups <https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/>`__ on 2K H200 clusters with the latest `torchao.float8` rowwise recipe.
 
-In this tutorial, we will show 2 ways to use the **torchao.float8** recipes for pretraining:
+In this tutorial, we will show 2 ways to use the **torchao.float8** recipes for pre-training:
 
-1. :ref:`Pretraining with torchtitan`, the offical PyTorch pretraining framework with native torchao integration.
-2. :ref:`Pretraining with torchao directly`, to integrate torchao's float8 training recipes into your own pretraining code.
+1. :ref:`Pre-training with torchtitan`, the offical PyTorch pre-training framework with native torchao integration.
+2. :ref:`Pre-training with torchao directly`, to integrate torchao's float8 training recipes into your own pre-training code.
 
 
-Pretraining with torchtitan
+Pre-training with torchtitan
 ###########################
 
-In this tutorial we'll pretrain Llama3 8b using torchtitan with torchao's float8 training recipes: rowwise scaling and tensorwise scaling.
+In this tutorial we'll pre-train Llama3-8B using torchtitan with torchao's float8 training recipes: rowwise scaling and tensorwise scaling.
 
-`Torchtitan <https://github.com/pytorch/torchtitan/>`__ is PyTorch's official pretraining framework that is natively integrated with torchao, and supports
+`Torchtitan <https://github.com/pytorch/torchtitan/>`__ is PyTorch's official pre-training framework that is natively integrated with torchao, and supports
 several popular flagship models with common forms of parallelism, float8 training, distributed checkpointing and more.
 See the torchtitan `docs <https://github.com/pytorch/torchtitan>`__ for additional details.
 
@@ -29,12 +37,12 @@ Prerequisites
 2. `Install torchao <https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation>`__.
 3. `Install torchtitan <https://github.com/pytorch/torchtitan/tree/main?tab=readme-ov-file#installation>`__, including the "downloading a tokenizer" step.
 
-You're now ready to start a pretraining job using one of the recipes below!
+You're now ready to start a pre-training job using one of the recipes below!
 
 Rowwise scaling
 ===============
 
-Run the following command from torchtitan root directory to launch a Llama3 8b training job on 8 GPUs with float8 rowwise training:
+Run the following command from torchtitan root directory to launch a Llama3-8B training job on 8 GPUs with float8 rowwise training:
 
 .. code:: console
 
@@ -104,10 +112,10 @@ Picking a recipe
 The higher throughput of tensorwise scaling comes at the cost of slightly higher quantization error (i.e., reduced numerical integrity vs bfloat16) compared to rowwise scaling.
 This is because rowwise scaling using a more granular scaling factor (per row, instead of per tensor), which limits the impact of outliers that can cause underflow during scaling.
 
-Below you can see the loss curves comparing bfloat16, float8 tensorwise, and float8 rowwise training for training Llama3 8b on 8xH100 GPUs:
+Below you can see the loss curves comparing bfloat16, float8 tensorwise, and float8 rowwise training for training Llama3-8B on 8xH100 GPUs:
 
 .. image:: ../static/fp8-loss-curves.png
-  :alt: Loss curves for training Llama3 8b on 8xH100s with torchtitan using bfloat16, float8 tensorwise, and float8 rowwise training.
+  :alt: Loss curves for training Llama3-8B on 8xH100s with torchtitan using bfloat16, float8 tensorwise, and float8 rowwise training.
 
 
 Important notes
@@ -117,12 +125,12 @@ Important notes
 * You must use :code:`--training.compile` to achieve high performance. torchao float8 training recipes are built natively on top of :code:`torch.compile`, so it will work out of the box!
 
 
-Pretraining with torchao directly
+Pre-training with torchao directly
 #################################
 
-In this tutorial we'll pretrain a toy model using torchao APIs directly.
+In this tutorial we'll pre-train a toy model using torchao APIs directly.
 
-You can use this workflow to integrate torchao into your own custom pretraining code directly.
+You can use this workflow to integrate torchao into your own custom pre-training code directly.
 
 Prerequisites
 ================
@@ -200,3 +208,8 @@ Below is a code snippet showing how to use it:
         'model_state_dict': m.state_dict(),
         'optimizer_state_dict': optimizer.state_dict(),
     }, 'checkpoint.pth')
+
+
+After pre-training your model, you can optionally fine-tune it to more domain-specific datasets
+and adapt it for eventual quantization during serving. In the `next part <finetuning.html>`__ of
+this tutorial, we will explore a few model optimization options during the fine-tuning step.