[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

choosehappy · 2024-10-14T16:58:14Z

What happened + What you expected to happen

I have a machine with 2 GPUs. In my use case, I need to run multiple TorchTrainers concurrently on different datasets.

Ray Train is setup to avoid fragmenting jobs across different GPUs, but in this case this leads to an obvious NCCL error:

Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 3b000

I would expect that when using Ray Train, the TorchTrainer internally knows that it should provide CUDA_VISIBLE_DEVICES equivalent to a "spread" across GPUs as opposed to a "pack" onto a single GPU.

I can get "something" that looks like the correct behavior if launching the first train job with GPU:.6, so it spreads across both jobs, and then launching a second train job with GPU:.4.

However, if you run a job with anything < .5, then both workers will try to use the same GPU resulting in the NCCL error.

I tried playing aggressively with something like:

cuda_dev=torch.device('cuda',train.get_context().get_local_rank())
model=ray.train.torch.prepare_model(model,cuda_dev)

but internally the Ray device_manager really hates this and there doesn't seem to be working combination that allows it to slide through.

Any thoughts?

Versions / Dependencies

Ray 2.37, Python 3.10

Reproduction script

import ray
from ray.train import ScalingConfig
from torchvision.models import resnet18
import ray.train.torch

ray.init()


def trainpred_func(config):
    model = resnet18(num_classes=10)
    model=ray.train.torch.prepare_model(model)
    

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func,scaling_config=scaling_config)
trainer.fit()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

dhirajtobii · 2024-10-24T15:16:49Z

Does Ray support fractional scaling config for gpu workers with default nccl backend. I can see that above code works fine with gloo backend

I am also interested in knowing the status of this feature

choosehappy · 2024-10-29T19:32:00Z

Coming back to this, i found a janky work around:

create 1 docker instance per GPU on a node

start within the first container:

CUDA_VISIBLE_DEVICES=0 ray start --address='172.20.0.2:6379'

and then within the second container:

CUDA_VISIBLE_DEVICES=1 ray start --address='172.20.0.2:6379'

etc etc

this essentially creates multiple "nodes", each with 1 GPU
and thus one can request a placement_strategy of SPREAD, which will avoid the packing behavior described above:

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1}, placement_strategy="SPREAD")

obviously this is a terrible idea for multiple reasons, for example, the total CPU count and RAM count, is 2x the correct amount. one can hack around that as well. one can also imagine a setting where the scheduler cannot SPREAD, and leave the system back in a fail state. STRICT_SPREAD is likely a smarter option to ensure failure if the spread wouldn't take place

but at least it "works" : )

sud474 · 2024-12-16T11:16:05Z

Hi, Did you find a solution to this?

choosehappy · 2024-12-16T11:17:19Z

only the janky one i mentioned above

choosehappy · 2025-01-28T16:07:02Z

seems to be resolved by the other ticket #49985 , now basically as shown below. did some light testing and seems to be okay. will reopen if any issues persist

export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0
export TRAIN_ENABLE_SHARE_CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0,1

and then in code:

import ray
from ray.train import ScalingConfig
from torchvision.models import resnet18
import ray.train.torch
import torch
import time
import os

ray.init()


def trainpred_func2(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    model = resnet18(num_classes=10)
   
    cuda_dev=torch.device('cuda',ray.train.get_context().get_local_rank())
    model=ray.train.torch.prepare_model(model,cuda_dev)
    time.sleep(10)


scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func2,scaling_config=scaling_config)
trainer.fit()

choosehappy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 14, 2024

anyscalesam added the train Ray Train Related Issue label Oct 16, 2024

choosehappy mentioned this issue Jan 21, 2025

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

Closed

choosehappy closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

choosehappy commented Oct 14, 2024

dhirajtobii commented Oct 24, 2024

choosehappy commented Oct 29, 2024

sud474 commented Dec 16, 2024

choosehappy commented Dec 16, 2024

choosehappy commented Jan 28, 2025

[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

Comments

choosehappy commented Oct 14, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

dhirajtobii commented Oct 24, 2024

choosehappy commented Oct 29, 2024

sud474 commented Dec 16, 2024

choosehappy commented Dec 16, 2024

choosehappy commented Jan 28, 2025