Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local_rank = setup_distributed() #15

Open
lq-blackcat opened this issue Apr 19, 2024 · 3 comments
Open

local_rank = setup_distributed() #15

lq-blackcat opened this issue Apr 19, 2024 · 3 comments

Comments

@lq-blackcat
Copy link

How long does distributed training initialization take?
dist.init_process_group(
backend=backend,
world_size=world_size,
rank=rank,
)

@beichenzbc
Copy link
Owner

Very quick. If you stuck in this process, usually there's a mistake in your script.

@lq-blackcat
Copy link
Author

#!/bin/bash
#SBATCH --job-name=long-clip
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --gres=gpu:1
#SBATCH --time=96:00:00
#SBATCH --comment pris718bobo

source ~/.bashrc

export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 train.py

What needs to be modified? Could you please provide some help. @beichenzbc

@gulizhoutao
Copy link

#!/bin/bash #SBATCH --job-name=long-clip #SBATCH --nodes=1 #SBATCH --ntasks=32 #SBATCH --gres=gpu:1 #SBATCH --time=96:00:00 #SBATCH --comment pris718bobo

source ~/.bashrc

export CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py

What needs to be modified? Could you please provide some help. @beichenzbc

Do you resolve this problem? I also get into the same case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants