Skip to content

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

Open
@cassanof

Description

@cassanof

Hello, I am getting the following error whenever I scale up training to 512 GPUs while using FSDP2 + AdamWFP8 + BF16 stochastic rounding:

  torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_method copy_(*(DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253,
7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468,
476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),)), DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253, 7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468, 476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),))), **{}): got RuntimeError('expand: attempting to expand a dimension of length 16192!')

  from user code:
     File "/home/federico/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/adam.py", line 189, in single_param_adam
      p.copy_(_fp32_to_bf16_sr(p_f32))

  Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Either scaling down the run, or using HSDP is a workaround to the problem, but not great.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions