local_rank = setup_distributed() #15

lq-blackcat · 2024-04-19T13:52:48Z

How long does distributed training initialization take?
dist.init_process_group(
backend=backend,
world_size=world_size,
rank=rank,
)

beichenzbc · 2024-04-19T15:57:06Z

Very quick. If you stuck in this process, usually there's a mistake in your script.

lq-blackcat · 2024-04-19T17:55:07Z

#!/bin/bash
#SBATCH --job-name=long-clip
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --gres=gpu:1
#SBATCH --time=96:00:00
#SBATCH --comment pris718bobo

source ~/.bashrc

export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 train.py

What needs to be modified? Could you please provide some help. @beichenzbc

gulizhoutao · 2024-05-27T11:01:47Z

#!/bin/bash #SBATCH --job-name=long-clip #SBATCH --nodes=1 #SBATCH --ntasks=32 #SBATCH --gres=gpu:1 #SBATCH --time=96:00:00 #SBATCH --comment pris718bobo

source ~/.bashrc

export CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py

What needs to be modified? Could you please provide some help. @beichenzbc

Do you resolve this problem? I also get into the same case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local_rank = setup_distributed() #15

local_rank = setup_distributed() #15

lq-blackcat commented Apr 19, 2024

beichenzbc commented Apr 19, 2024

lq-blackcat commented Apr 19, 2024

gulizhoutao commented May 27, 2024

local_rank = setup_distributed() #15

local_rank = setup_distributed() #15

Comments

lq-blackcat commented Apr 19, 2024

beichenzbc commented Apr 19, 2024

lq-blackcat commented Apr 19, 2024

gulizhoutao commented May 27, 2024