refresh MX readme (#1989)

vkuzo · web-flow · commit 934d11ead010 · 2025-04-01T12:07:55.000-07:00
diff --git a/torchao/prototype/mx_formats/README.md b/torchao/prototype/mx_formats/README.md
@@ -1,43 +1,24 @@
-# MX formats with native PyTorch POC
+# MX training and inference with native PyTorch
 
-This is a POC of training and inference with tensors in the MX format from the OCP spec (https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) in native PyTorch.
+This is a workflow for e2e training and inference with MX dtypes from the [MX OCP spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) 
+in native PyTorch.  We are currently in prototype and are actively working on optimizing these workflows on the NVIDIA B200 hardware.
 
-Note that the current version of the code is written for readability and
-numerical correctness and not yet for optimal performance. We welcome
-contributions on performance improvements.
+## Overall status
 
-Note that there are no BC guarantees at the moment and we plan to evolve
-this code as the hardware specifics of MX-accelerated matmuls become
-known.
+| workflow | emulation | performance | accuracy |
+| --- | --- | --- | --- |
+| training with mxfp8 | ✅ | 🚧 [active development](https://github.com/pytorch/ao/issues/1768) | ✅ |
+| inference (weight-only) with mxfp8, mxfp6, mxfp4 | ✅ | 🔲 | 🔲 |
 
-# Current status
+We plan to add the following features in the near future:
+* other inference workflows such as dynamic quantization
+* a unified training to inference workflow
 
-## user API (subject to change)
+ℹ️ <em>See the [feature tracker](https://github.com/pytorch/ao/issues/556) and the [performance tracker](https://github.com/pytorch/ao/issues/1768) for upcoming features.</em>
 
-### MXTensor
+# User API
 
-This is casts between high precision and MX formats implemented in native PyTorch. Currently
-only `torch.float32` and `torch.bfloat16` are supported as high precision formats.
-
-```python
-from torchao.prototype.mx_formats.mx_tensor import MXTensor
-# Note: MX int8 is not implemented yet
-from torchao.prototype.mx_formats.constants import DTYPE_FP6_E2M3, DTYPE_FP6_E3M2, DTYPE_FP4
-x = torch.randn(32, 32, device='cuda')
-
-# elem_dtype can be torch.float8_e4m3fn, torch.float8_e5m2, DTYPE_FP6_E2M3, DTYPE_FP6_E3M2, DTYPE_FP4
-elem_dtype = torch.float8_e4m3fn
-
-# high precision to MX, block size defaults to 32
-x_mx = MXTensor.to_mx(x, elem_dtype)
-
-# mx back to high precision
-x_hp = x_mx.to_dtype(torch.float)
-```
-
-### MXLinear
-
-This is a module to do MX training, the MX matmul is currently emulated.
+## MX training
 
 ```python
 import torch
@@ -62,9 +43,9 @@ quantize_(m, config)
 # training loop (not shown)
 ```
 
-### MXInferenceLinear
+## MX inference
 
-This is a module to do MX inference, weights are in MX and matmul is in high precision.
+Note: currently only weight-only quantization is supported.
 
 ```python
 import torch
@@ -82,39 +63,75 @@ quantize_(m, config=config)
 
 # do inference (not shown)
 ```
+## MXTensor
 
-## accuracy status
-* we match bitwise to other implementations of the OCP MX spec (code not in this repo), with a couple of edge cases left to resolve
-* approximate numerics pass for `MXLinear` and `MXInferenceLinear` on sample inputs
-* LLaMa 3 8B pretraining on 4 GPUs for 500 iterations shows that loss convergence is not meaningfully degraded (code not in this repo)
+This is casts between high precision and MX formats implemented in native PyTorch. Currently
+only `torch.float32` and `torch.bfloat16` are supported as high precision formats.
+
+```python
+from torchao.prototype.mx_formats.mx_tensor import MXTensor
+# Note: MX int8 is not implemented yet
+from torchao.prototype.mx_formats.constants import DTYPE_FP6_E2M3, DTYPE_FP6_E3M2, DTYPE_FP4
+x = torch.randn(32, 32, device='cuda')
+
+# elem_dtype can be torch.float8_e4m3fn, torch.float8_e5m2, DTYPE_FP6_E2M3, DTYPE_FP6_E3M2, DTYPE_FP4
+elem_dtype = torch.float8_e4m3fn
+
+# high precision to MX, block size defaults to 32
+x_mx = MXTensor.to_mx(x, elem_dtype)
 
-## performance status
+# mx back to high precision
+x_hp = x_mx.to_dtype(torch.float)
+```
 
-### quant and dequant
+# performance
 
-* we have a benchmark of quantizing and dequantizing mxfp8 and mxfp4 tensors with size (1, 4096, 11008)
-* latest numbers: https://gist.github.com/vkuzo/83656e4a74777cfc0915de6b27be1ff6
+## mxfp8 gemm
 
-## testing and benchmarking
+On NVIDIA B200 machines, we use the cuBLAS mxfp8 gemm exposed via the `torch._scaled_mm` op. 
+We observe a speedup of **2x to 3x** vs the bf16 baseline on common shapes.  To reproduce this
+on supported hardware, you can run the following command:
 
 ```bash
-# numerical testing of custom fp4 and fp6 casts
-pytest test/prototype/mx_formats/test_custom_cast.py
-# testing of MXTensor
-pytest test/prototype/mx_formats/test_mx_tensor.py
-# testing of MXLinear and MXInferenceLinear
-pytest test/prototype/mx_formats/test_mx_linear.py
-
-# run the quant and dequant benchmark
-python torchao/prototype/mx_formats/benchmarks/bench_qdq.py
+> python benchmarks/float8/bench_matmul.py --recipe mxfp8_cublas
+// example output: https://gist.github.com/vkuzo/a1ddb782e6e1c2aef0c726b3df99efbc
 ```
 
-## floating point format convenience functions
+## to_mx cast across dim0 and dim1
+
+On NVIDIA B200 machines, our to_mx kernels for mxfp8 achieve **up to 5.5 TB/s** for the dim0 cast (with torch.compile),
+and **up to 3.9 TB/s** for the dim1 cast (with a triton kernel). We are actively working on improving
+the performance of this cast ([details](https://github.com/pytorch/ao/issues/1768)).
+
+To reproduce this on supported hardware, you can run the following command:
+
+```bash
+// dim0 cast with torch.compile
+> python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384
+// example output: https://gist.github.com/vkuzo/06aae58de9b8aae02c82adb00eb33197
+
+// dim1 cast with a handwritten triton kernel
+> python benchmarks/mx_formats/cast_bench.py --mode dim1_mx_triton --M 16384 --K 16384
+// example output: https://gist.github.com/vkuzo/7ac5fce44c9b90bfb9eae2a07b721cda
+```
+
+## performance tracker
+
+Please see our [performance tracker](https://github.com/pytorch/ao/issues/1768) for the latest on MX training and inference performance!
+
+# accuracy
+
+## training
+
+* LLaMa 3 8B pretraining on 4 GPUs for 500 iterations shows that loss convergence is not meaningfully degraded (code not in this repo)
+* we match bitwise to other implementations of the OCP MX spec (code not in this repo), with a couple of edge cases left to resolve
+
+## inference
+
+Coming soon!
 
-We have a convenience script which summarizes the various properties of
-floating point formats:
+# testing
 
 ```bash
-python torchao/prototype/mx_formats/fp_format_spec.py
-# example output: https://gist.github.com/vkuzo/b8e114aa83736f87d6618b16aa8588c0
+pytest test/prototype/mx_formats/
 ```