### Bug description Llama3 8b on 4xH100s with per op SAC, using FSDP=2, TP=2 - bf16: 5378 TPS, 45.68 GiB peak memory - float8 rowwise: 5189 TPS, 45.67 GiB peak memory ### Versions - torch 2.8.0a0+gite21ad6e - torchtitan @ HEAD - torchao 0.11.0