pytorch
diff --git a/‎docs/source/finetuning.rst
Lines changed: 293 additions & 0 deletions b/‎docs/source/finetuning.rst
Lines changed: 293 additions & 0 deletions
diff --git a/‎docs/source/index.rst
Lines changed: 5 additions & 3 deletions b/‎docs/source/index.rst
Lines changed: 5 additions & 3 deletions
@@ -0,0 +1,293 @@
+(Part 2) Fine-tuning with QAT, QLoRA, and float8
+------------------------------------------------
+
+TorchAO provides an end-to-end pre-training, fine-tuning, and serving
+model optimization flow by leveraging our quantization and sparsity
+techniques integrated into our partner frameworks. This is part 2 of 3
+such tutorials showcasing this end-to-end flow, focusing on the
+fine-tuning step.
+
+.. image:: ../static/e2e_flow_part2.png
+
+Fine-tuning is an important step for adapting your pre-trained model
+to more domain-specific data. In this tutorial, we demonstrate 3 model
+optimization techniques that can be applied to your model during fine-tuning:
+
+1. **Quantization-Aware Training (QAT)**, for adapting your model to
+quantization numerics during fine-tuning, with the goal of mitigating
+quantization degradations in your fine-tuned model when it is quantized
+eventually, e.g. in the serving step. Check out `our blog <https://pytorch.org/blog/quantization-aware-training/>`__
+and `README <https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md>`__ for more details!
+
+2. **Quantized Low-Rank Adaptation (QLoRA)**, for reducing the resource
+requirement of fine-tuning by introducing small, trainable low-rank
+matrices and freezing the original pre-trained checkpoint, a type of
+Parameter-Efficient Fine-Tuning (PEFT). Please refer to the `original
+paper <https://arxiv.org/pdf/2305.14314>`__ for more details.
+
+3. **Float8 Quantized Fine-tuning**, for speeding up fine-tuning by
+dynamically quantizing high precision weights and activations to float8,
+similar to `pre-training in float8 <pretraining.html>`__.
+
+
+Quantization-Aware Training (QAT)
+##################################
+
+The goal of Quantization-Aware Training is to adapt the model to
+quantization numerics during training or fine-tuning, so as to mitigate
+the inevitable quantization degradation when the model is actually
+quantized eventually, presumably during the serving step after fine-tuning.
+TorchAO's QAT support has been used successfully for the recent release of
+the `Llama-3.2 quantized 1B/3B <https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/>`__
+and the `LlamaGuard-3-8B <https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/8B/MODEL_CARD.md>`__ models to improve the quality of the quantized models.
+
+TorchAO's QAT support involves two separate steps: prepare and convert.
+The prepare step "fake" quantizes activations and/or weights during
+training, which means, the high precision values (e.g. bf16) are mapped
+to their corresponding quantized values *without* actually casting them
+to the target lower precision dtype (e.g. int4). The convert step,
+applied after training, replaces "fake" quantization operations in the
+model with "real" quantization that does perform the dtype casting:
+
+.. image:: ../../torchao/quantization/qat/images/qat_diagram.png
+
+There are multiple options for using TorchAO's QAT for fine-tuning:
+
+1. Use our integration with `TorchTune <https://github.com/pytorch/torchtune>`__
+2. Use our integration with `Axolotl <https://github.com/axolotl-ai-cloud/axolotl>`__
+3. Directly use our QAT APIs with your own training loop
+
+
+Option 1: TorchTune QAT Integration
+===================================
+
+TorchAO's QAT support is integrated into TorchTune's distributed fine-tuning recipe.
+Instead of the following command, which applies full distributed fine-tuning without QAT:
+
+.. code::
+
+  # Regular fine-tuning without QAT
+  tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config llama3_2/3B_full batch_size=16
+
+Users can run the following equivalent command instead. Note that specifying the quantizer
+is optional:
+
+.. code::
+
+  # Fine-tuning with QAT, by default:
+  #   activations are fake quantized to asymmetric per token int8
+  #   weights are fake quantized to symmetric per group int4 
+  #   configurable through "quantizer._component_" in the command
+  tune run --nnodes 1 --nproc_per_node 4 qat_distributed --config llama3_2/3B_qat_full batch_size=16
+
+After fine-tuning, users can quantize and evaluate the resulting model as follows.
+This is the same whether or not QAT was used during the fine-tuning process:
+
+.. code::
+
+  # Quantize model weights to int4
+  tune run quantize --config quantization \
+      model._component_=torchtune.models.llama3_2.llama3_2_3b \
+      checkpointer._component_=torchtune.training.FullModelHFCheckpointer \
+      'checkpointer.checkpoint_files=[model-00001-of-00002.safetensors,model-00002-of-00002.safetensors]' \
+      checkpointer.model_type=LLAMA3 \
+      quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \
+      quantizer.groupsize=32
+
+  # Evaluate the int4 model on hellaswag and wikitext
+  tune run eleuther_eval --config eleuther_evaluation \
+      batch_size=1 \
+      'tasks=[hellaswag, wikitext]' \
+      model._component_=torchtune.models.llama3_2.llama3_2_3b \
+      checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \
+      'checkpointer.checkpoint_files=[model-00001-of-00002-8da4w.ckpt]' \
+      checkpointer.model_type=LLAMA3 \
+      tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \
+      tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model \
+      quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \
+      quantizer.groupsize=32
+
+This should print the following after fine-tuning:
+
+.. code::
+
+  |  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
+  |---------|------:|------|------|--------|---|-----:|---|-----:|
+  |hellaswag|      1|none  |None  |acc     |↑  |0.5021|±  |0.0050|
+  |         |       |none  |None  |acc_norm|↑  |0.6797|±  |0.0047|
+
+  | Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
+  |--------|------:|------|------|---------------|---|------:|---|------|
+  |wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.6965|±  |   N/A|
+  |        |       |none  |None  |byte_perplexity|↓  | 1.6206|±  |   N/A|
+  |        |       |none  |None  |word_perplexity|↓  |13.2199|±  |   N/A|
+
+You can compare these values with and without QAT to see how much QAT helped mitigate quantization degradation!
+For example, when fine-tuning Llama-3.2-3B on the
+`OpenAssistant Conversations (OASST1) <https://huggingface.co/datasets/OpenAssistant/oasst1>`__
+dataset, we find that the quantized model achieved 3.4% higher accuracy
+with QAT than without, recovering 69.8% of the overall accuracy degradation
+from quantization:
+
+.. image:: ../static/qat_eval.png
+
+In addition to vanilla QAT as in the above example, TorchAO's QAT can also be composed with LoRA to yield a `1.89x training speedup <https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700>`__. This is implemented in TorchTune's `QAT + LoRA fine-tuning recipe <https://github.com/pytorch/torchtune/blob/main/recipes/qat_lora_finetune_distributed.py>`__, which can be run using the following command:
+
+.. code::
+
+  # Fine-tuning with QAT + LoRA
+  tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3_2/3B_qat_lora batch_size=16
+
+For more details about how QAT is set up in TorchTune, please refer to `this tutorial <https://docs.pytorch.org/torchtune/main/tutorials/qat_finetune.html>`__.
+
+
+Option 2: Axolotl QAT Integration
+=================================
+
+Axolotl also recently added a QAT fine-tuning recipe that leverages TorchAO's QAT support.
+To get started, try fine-tuning Llama-3.2-3B with QAT using the following command:
+
+.. code::
+
+  axolotl train examples/llama-3/3b-qat-fsdp2.yaml
+  # once training is complete, perform the quantization step
+
+  axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml
+  # you should now have a quantized model saved in ./outputs/qat_out/quatized
+
+Please refer to the `Axolotl QAT documentation <https://docs.axolotl.ai/docs/qat.html>`__ for full details.
+
+
+Option 3: TorchAO QAT API
+=========================
+
+If you prefer to use a different training framework or your own custom training loop,
+you can call TorchAO's QAT APIs directly to transform the model before fine-tuning.
+These APIs are what the TorchTune and Axolotl QAT integrations call under the hood.
+
+In this example, we will fine-tune a mini version of Llama3 on a single GPU:
+
+.. code:: py
+
+  import torch
+  from torchtune.models.llama3 import llama3
+
+  # Set up a smaller version of llama3 to fit in a single A100 GPU
+  # For smaller GPUs, adjust the model attributes accordingly
+  def get_model():
+      return llama3(
+          vocab_size=4096,
+          num_layers=16,
+          num_heads=16,
+          num_kv_heads=4,
+          embed_dim=2048,
+          max_seq_len=2048,
+      ).cuda()
+
+  # Example training loop
+  def train_loop(m: torch.nn.Module):
+      optimizer = torch.optim.SGD(m.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-5)
+      loss_fn = torch.nn.CrossEntropyLoss()
+      for i in range(10):
+          example = torch.randint(0, 4096, (2, 16)).cuda()
+          target = torch.randn((2, 16, 4096)).cuda()
+          output = m(example)
+          loss = loss_fn(output, target)
+          loss.backward()
+          optimizer.step()
+          optimizer.zero_grad()
+
+Next, run the prepare step, which fake quantizes the model. In this example,
+we use int8 per token dynamic activations and int4 symmetric per group weights
+as our quantization scheme. Note that although we are targeting lower integer
+precisions, training still performs arithmetic in higher float precision (float32)
+because we are not actually casting the fake quantized values.
+
+.. code:: py
+
+  from torchao.quantization import (
+      quantize_,
+  )
+  from torchao.quantization.qat import (
+      FakeQuantizeConfig,
+      IntXQuantizationAwareTrainingConfig,
+  )
+  model = get_model()
+
+  # prepare: insert fake quantization ops
+  # swaps `torch.nn.Linear` with `FakeQuantizedLinear`
+  activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
+  weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
+  qat_config = IntXQuantizationAwareTrainingConfig(activation_config, weight_config)
+  quantize_(model, qat_config)
+
+  # fine-tune
+  train_loop(model)
+
+After fine-tuning, we end up with a model in the original high precision.
+This fine-tuned model has the exact same structure as the original model.
+The only difference is the QAT fine-tuned model has weights that are more
+attuned to quantization, which will be beneficial later during inference.
+The next step is to actually quantize the model:
+
+.. code:: py
+
+  from torchao.quantization import (
+      Int8DynamicActivationInt4WeightConfig,
+  )
+  from torchao.quantization.qat import (
+      FromIntXQuantizationAwareTrainingConfig,
+  )
+
+  # convert: transform fake quantization ops into actual quantized ops
+  # swap `FakeQuantizedLinear` back to `torch.nn.Linear` and inserts
+  # quantized activation and weight tensor subclasses
+  quantize_(model, FromIntXQuantizationAwareTrainingConfig())
+  quantize_(model, Int8DynamicActivationInt4WeightConfig(group_size=32))
+
+Now our model is ready for serving, and will typically have higher quantized
+accuracy than if we did not apply the prepare step (fake quantization) during
+fine-tuning.
+
+For full details of using TorchAO's QAT API, please refer to the `QAT README <https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md>`__.
+
+.. raw:: html
+
+   <details>
+   <summary><a>Alternative Legacy API</a></summary>
+
+The above `quantize_` API is the recommended flow for using TorchAO QAT.
+We also offer an alternative legacy "quantizer" API for specific quantization
+schemes, but these are not customizable unlike the above example.
+
+.. code::
+
+  from torchao.quantization.qat import Int8DynActInt4WeightQATQuantizer
+  qat_quantizer = Int8DynActInt4WeightQATQuantizer(group_size=32)
+
+  # prepare: insert fake quantization ops
+  # swaps `torch.nn.Linear` with `Int8DynActInt4WeightQATLinear`
+  model = qat_quantizer.prepare(model)
+
+  # train
+  train_loop(model)
+
+  # convert: transform fake quantization ops into actual quantized ops
+  # swaps `Int8DynActInt4WeightQATLinear` with `Int8DynActInt4WeightLinear`
+  model = qat_quantizer.convert(model)
+
+.. raw:: html
+
+   </details>
+
+
+Quantized Low-Rank Adaptation (QLoRA)
+#####################################
+
+(Coming soon!)
+
+
+Float8 Quantized Fine-tuning
+############################
+
+(Coming soon!)
@@ -37,12 +37,14 @@ for an overall introduction to the library and recent highlight and updates.
    :maxdepth: 1
    :caption: Eager Quantization Tutorials
 
+   pretraining
+   finetuning
+   serving
+   torchao_vllm_integration
    serialization
+   static_quantization
    subclass_basic
    subclass_advanced
-   static_quantization
-   pretraining
-   torchao_vllm_integration
 
 .. toctree::
    :glob: