Skip to content

Distributed Training Env Variables #469

Answered by rwightman
ademyanchuk asked this question in Q&A
Discussion options

You must be logged in to vote

@ademyanchuk I don't set any, I use the distributed_train.sh script as is for same-node DDP training.

For multi-node, on the rare occasions I've used it, I manually add the rank/master info into the args of a similar shell script per-machine like the example below.

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_PER_MACHINE
               --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
               --master_port=1234 train.py 

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@ademyanchuk
Comment options

Answer selected by ademyanchuk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants