[BE] [docs] Add float8 pretraining tutorial to docsite (#2304)

danielvegamyhre · web-flow · commit 488ecd4cf871 · 2025-06-05T15:56:49.000-04:00
* add float8 pretraining tutorial

* make empty commit to trigger ci

* remove references to e2e tutorial
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -37,6 +37,7 @@
 # ones.
 extensions = [
     "sphinx.ext.autodoc",
+    "sphinx.ext.autosectionlabel",
     "sphinx.ext.autosummary",
     "sphinx.ext.doctest",
     "sphinx.ext.intersphinx",
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -40,3 +40,4 @@ for an overall introduction to the library and recent highlight and updates.
    serialization
    subclass_basic
    subclass_advanced
+   pretraining
diff --git a/docs/source/pretraining.rst b/docs/source/pretraining.rst
@@ -0,0 +1,202 @@
+Pretraining with float8
+---------------------------------
+
+Pretraining with float8 using torchao can provide `up to 1.5x speedups <https://pytorch.org/blog/training-using-float8-fsdp2/>`__ on 512 GPU clusters,
+and up to `1.34-1.43x speedups <https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/>`__ on 2K H200 clusters with the latest `torchao.float8` rowwise recipe.
+
+In this tutorial, we will show 2 ways to use the **torchao.float8** recipes for pretraining:
+
+1. :ref:`Pretraining with torchtitan`, the offical PyTorch pretraining framework with native torchao integration.
+2. :ref:`Pretraining with torchao directly`, to integrate torchao's float8 training recipes into your own pretraining code.
+
+
+Pretraining with torchtitan
+###########################
+
+In this tutorial we'll pretrain Llama3 8b using torchtitan with torchao's float8 training recipes: rowwise scaling and tensorwise scaling.
+
+`Torchtitan <https://github.com/pytorch/torchtitan/>`__ is PyTorch's official pretraining framework that is natively integrated with torchao, and supports
+several popular flagship models with common forms of parallelism, float8 training, distributed checkpointing and more.
+See the torchtitan `docs <https://github.com/pytorch/torchtitan>`__ for additional details.
+
+You can use this workflow to get started quickly with a "batteries included" experience. Users commonly
+fork torchtitan and build on top of it when they're ready.
+
+Prerequisites
+================
+
+1. (Recommended) Create a new virtual environment with conda or venv.
+2. `Install torchao <https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation>`__.
+3. `Install torchtitan <https://github.com/pytorch/torchtitan/tree/main?tab=readme-ov-file#installation>`__, including the "downloading a tokenizer" step.
+
+You're now ready to start a pretraining job using one of the recipes below!
+
+Rowwise scaling
+===============
+
+Run the following command from torchtitan root directory to launch a Llama3 8b training job on 8 GPUs with float8 rowwise training:
+
+.. code:: console
+
+    NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.compile --model.converters="float8" --float8.recipe_name="rowwise"
+
+Torchtitan will automatically use FSDP2 to parallelize training when more than 1 GPU is used. To use other forms of parallelism, modify hyperparameters, or change other training configurations, you can directly edit the `llama3_8b.toml <https://github.com/pytorch/torchtitan/blob/775a486edd173ceb9be1c9b1b30af6ca2d4b4fa0/torchtitan/models/llama3/train_configs/llama3_8b.toml>`__ file or use command line flags (run the command with :code:`--help` to see more options).
+
+You should see terminal output that looks like this:
+
+.. code:: console
+
+    [rank0]:[titan] 2025-06-04 08:51:48,074 - root - INFO - step:  1  loss: 12.2254  memory: 27.34GiB(28.78%)  tps: 375  tflops: 21.73  mfu: 2.20%
+    [rank0]:[titan] 2025-06-04 08:51:58,557 - root - INFO - step: 10  loss: 10.7069  memory: 30.99GiB(32.62%)  tps: 7,034  tflops: 407.35  mfu: 41.19%
+    [rank0]:[titan] 2025-06-04 08:52:10,224 - root - INFO - step: 20  loss:  8.9196  memory: 30.99GiB(32.62%)  tps: 7,022  tflops: 406.65  mfu: 41.12%
+    [rank0]:[titan] 2025-06-04 08:52:21,904 - root - INFO - step: 30  loss:  8.1423  memory: 30.99GiB(32.62%)  tps: 7,014  tflops: 406.23  mfu: 41.08%
+
+As you can see, ignoring the warmup steps we are achieving around ~7k TPS with 30.99GB peak memory usage. To compare performance against bfloat16 training, you can remove the :code:`--model.converters="float8" --float8.recipe_name="rowwise"` flags
+and run the same command to see the baseline performance of bfloat16 training:
+
+.. code:: console
+
+    NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.compile
+
+You should see the following output:
+
+.. code:: console
+
+    [rank0]:[titan] 2025-06-04 11:02:37,404 - root - INFO - step:  1  loss: 12.2611  memory: 27.22GiB(28.65%)  tps: 595  tflops: 34.47  mfu: 3.49%
+    [rank0]:[titan] 2025-06-04 11:02:49,027 - root - INFO - step: 10  loss: 10.4260  memory: 30.89GiB(32.51%)  tps: 6,344  tflops: 367.39  mfu: 37.15%
+    [rank0]:[titan] 2025-06-04 11:03:01,988 - root - INFO - step: 20  loss:  8.9482  memory: 30.89GiB(32.51%)  tps: 6,321  tflops: 366.06  mfu: 37.01%
+    [rank0]:[titan] 2025-06-04 11:03:14,991 - root - INFO - step: 30  loss:  8.1183  memory: 30.89GiB(32.51%)  tps: 6,300  tflops: 364.89  mfu: 36.89%
+    [rank0]:[titan] 2025-06-04 11:03:28,013 - root - INFO - step: 40  loss:  7.4659  memory: 30.89GiB(32.51%)  tps: 6,291  tflops: 364.36  mfu: 36.84%
+    [rank0]:[titan] 2025-06-04 11:03:39,769 - root - INFO - [GC] Peforming periodical GC collection. 0.02 seconds.
+
+As you can see, the bfloat16 baseline achieves ~6.3k TPS using 30.89GB peak memory.
+
+This means our float8 rowwise scaling recipe achieves **1.11x higher throughput** compared to bfloat16 baseline, using nearly identical peak memory!
+
+Note that you can achieve even higher throughput improvement using the tensorwise scaling recipe, which exists at a different point on the performane vs accuracy curve.
+
+Tensorwise scaling
+==================
+
+Float8 training with tensorwise scaling is the default recipe, so we can omit the :code:`--float8.recipe_name` flag:
+
+.. code:: console
+
+    NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.compile --model.converters="float8"
+
+You should see the output like the following:
+
+.. code:: console
+
+    [rank0]:[titan] 2025-06-04 10:52:19,648 - root - INFO - step:  1  loss: 12.2648  memory: 27.28GiB(28.71%)  tps: 557  tflops: 32.29  mfu: 3.26%
+    [rank0]:[titan] 2025-06-04 10:52:29,475 - root - INFO - step: 10  loss: 10.9106  memory: 30.91GiB(32.53%)  tps: 7,503  tflops: 434.53  mfu: 43.94%
+    [rank0]:[titan] 2025-06-04 10:52:40,166 - root - INFO - step: 20  loss:  9.0774  memory: 30.91GiB(32.53%)  tps: 7,663  tflops: 443.78  mfu: 44.87%
+    [rank0]:[titan] 2025-06-04 10:52:50,885 - root - INFO - step: 30  loss:  8.3233  memory: 30.91GiB(32.53%)  tps: 7,643  tflops: 442.66  mfu: 44.76%
+    [rank0]:[titan] 2025-06-04 10:53:01,613 - root - INFO - step: 40  loss:  7.6150  memory: 30.91GiB(32.53%)  tps: 7,637  tflops: 442.27  mfu: 44.72%
+
+As you can see, we are achieving ~7.6k TPS using 30.91GB peak memory, which is **1.21x higher throughput** compared to the bfloat16 baseline!
+
+Picking a recipe
+================
+
+**TL;DR**: rowwise scaling is better for jobs prioritizing more accurate numerics and training stability, and tensorwise is better for jobs prioritizing training throughput.
+
+The higher throughput of tensorwise scaling comes at the cost of slightly higher quantization error (i.e., reduced numerical integrity vs bfloat16) compared to rowwise scaling.
+This is because rowwise scaling using a more granular scaling factor (per row, instead of per tensor), which limits the impact of outliers that can cause underflow during scaling.
+
+Below you can see the loss curves comparing bfloat16, float8 tensorwise, and float8 rowwise training for training Llama3 8b on 8xH100 GPUs:
+
+.. image:: ../static/fp8-loss-curves.png
+  :alt: Loss curves for training Llama3 8b on 8xH100s with torchtitan using bfloat16, float8 tensorwise, and float8 rowwise training.
+
+
+Important notes
+===============
+
+* float8 training is currently only supported on 2+ GPUs in torchtitan, not single GPU training.
+* You must use :code:`--training.compile` to achieve high performance. torchao float8 training recipes are built natively on top of :code:`torch.compile`, so it will work out of the box!
+
+
+Pretraining with torchao directly
+#################################
+
+In this tutorial we'll pretrain a toy model using torchao APIs directly.
+
+You can use this workflow to integrate torchao into your own custom pretraining code directly.
+
+Prerequisites
+================
+
+1. (Recommended) Create a new virtual environment with conda or venv.
+2. `Install torchao <https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation>`__.
+
+You're now ready to integrate torchao into your training code directly!
+
+Model conversion API
+====================
+
+The torchao API for converting your model to use float8 training is: `convert_to_float8_training <https://github.com/pytorch/ao/blob/152a8e397e1383c55bf7b87a8eaa2b87ffb2c114/torchao/float8/float8_linear_utils.py#L84>`__. This API will recursively convert :code:`nn.Linear` modules in your model to use `Float8Linear <https://github.com/pytorch/ao/blob/152a8e397e1383c55bf7b87a8eaa2b87ffb2c114/torchao/float8/float8_linear.py#L254>`__.
+
+You can use the :code:`module_filter_fn` argument to determine which :code:`nn.Linear` layers should be swapped to use :code:`Float8Linear`.
+
+You should refer to this `performance benchmark table <https://github.com/pytorch/ao/tree/152a8e397e1383c55bf7b87a8eaa2b87ffb2c114/torchao/float8#performance>`__ to understand
+what kind of performance improvement over bfloat16 you can expect for a given GEMM size.
+
+Below is a code snippet showing how to use it:
+
+.. code:: py
+
+    import torch
+    from torch import nn
+    import torch.nn.functional as F
+
+    from torchao.float8.float8_linear_utils import convert_to_float8_training
+    from torchao.float8.float8_linear import Float8Linear
+    from torchao.float8 import convert_to_float8_training
+    from torchao.utils import TORCH_VERSION_AT_LEAST_2_5
+
+    if not TORCH_VERSION_AT_LEAST_2_5:
+        raise AssertionError("torchao.float8 requires PyTorch version 2.5 or greater")
+
+    # create model and sample input
+    m = nn.Sequential(
+        nn.Linear(2048, 4096),
+        nn.Linear(4096, 128),
+        nn.Linear(128, 1),
+    ).bfloat16().cuda()
+    x = torch.randn(4096, 2048, device="cuda", dtype=torch.bfloat16)
+    optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
+
+    # optional: filter modules from being eligible for float8 conversion
+    def module_filter_fn(mod: torch.nn.Module, fqn: str):
+        # don't convert the last module
+        if fqn == "1":
+            return False
+        # don't convert linear modules with weight dimensions not divisible by 16
+        if isinstance(mod, torch.nn.Linear):
+            if mod.in_features % 16 != 0 or mod.out_features % 16 != 0:
+                return False
+        return True
+
+    # convert specified `torch.nn.Linear` modules to `Float8Linear`
+    convert_to_float8_training(m, module_filter_fn=module_filter_fn)
+
+    # enable torch.compile for competitive performance
+    m = torch.compile(m)
+
+    # toy training loop
+    for _ in range(10):
+        optimizer.zero_grad()
+        output = m(x)
+        # use fake labels for demonstration purposes
+        fake_labels = torch.ones_like(output)
+        loss = F.mse_loss(output, fake_labels)
+        loss.backward()
+        optimizer.step()
+
+    # save the model
+    torch.save({
+        'model': m,
+        'model_state_dict': m.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+    }, 'checkpoint.pth')
diff --git a/docs/static/fp8-loss-curves.png b/docs/static/fp8-loss-curves.png