Open
Description
In Megatron repo https://github.com/NVIDIA/Megatron-LM/blob/4429e8ebe21fb011529d7401c370841ce530785a/megatron/training/arguments.py#L779
It’s recommended that FSDP should use larger values of CUDA_DEVICE_MAX_CONNECTIONS
but Megatron TP requires it to be 1. Is it also the case for torch implementation of TP using DTensor?
How should I configure the environment variable when using torch implementation of FSDP(2) and/or TP/CP/SP?