Distributed Training Env Variables #469

ademyanchuk · 2021-03-03T11:12:09Z

ademyanchuk
Mar 3, 2021

Thank you for the brilliant work you are doing. I have a question with regard to DDP training. What environment variables should one setup before start training?
Thanks in advance!

Cheers,
Alexey

Answered by rwightman

Mar 3, 2021

@ademyanchuk I don't set any, I use the distributed_train.sh script as is for same-node DDP training.

For multi-node, on the rare occasions I've used it, I manually add the rank/master info into the args of a similar shell script per-machine like the example below.

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_PER_MACHINE
               --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
               --master_port=1234 train.py

View full answer

rwightman · 2021-03-03T18:04:12Z

rwightman
Mar 3, 2021
Maintainer

@ademyanchuk I don't set any, I use the distributed_train.sh script as is for same-node DDP training.

For multi-node, on the rare occasions I've used it, I manually add the rank/master info into the args of a similar shell script per-machine like the example below.

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_PER_MACHINE
               --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
               --master_port=1234 train.py

1 reply

ademyanchuk Mar 4, 2021
Author

Thank you, @rwightman !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Distributed Training Env Variables #469

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Distributed Training Env Variables #469

Uh oh!

ademyanchuk Mar 3, 2021

Replies: 1 comment · 1 reply

Uh oh!

rwightman Mar 3, 2021 Maintainer

Uh oh!

ademyanchuk Mar 4, 2021 Author

ademyanchuk
Mar 3, 2021

Replies: 1 comment 1 reply

rwightman
Mar 3, 2021
Maintainer

ademyanchuk Mar 4, 2021
Author