Open
Description
Hello, I am getting the following error whenever I scale up training to 512 GPUs while using FSDP2 + AdamWFP8 + BF16 stochastic rounding:
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_method copy_(*(DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253,
7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468,
476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),)), DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253, 7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468, 476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),))), **{}): got RuntimeError('expand: attempting to expand a dimension of length 16192!')
from user code:
File "/home/federico/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/adam.py", line 189, in single_param_adam
p.copy_(_fp32_to_bf16_sr(p_f32))
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Either scaling down the run, or using HSDP is a workaround to the problem, but not great.