-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012
Comments
Does Ray support fractional scaling config for gpu workers with default nccl backend. I can see that above code works fine with gloo backend I am also interested in knowing the status of this feature |
Coming back to this, i found a janky work around:
start within the first container: CUDA_VISIBLE_DEVICES=0 ray start --address='172.20.0.2:6379' and then within the second container: CUDA_VISIBLE_DEVICES=1 ray start --address='172.20.0.2:6379' etc etc
scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1}, placement_strategy="SPREAD") obviously this is a terrible idea for multiple reasons, for example, the total CPU count and RAM count, is 2x the correct amount. one can hack around that as well. one can also imagine a setting where the scheduler cannot SPREAD, and leave the system back in a fail state. STRICT_SPREAD is likely a smarter option to ensure failure if the spread wouldn't take place but at least it "works" : ) |
Hi, Did you find a solution to this? |
only the janky one i mentioned above |
seems to be resolved by the other ticket #49985 , now basically as shown below. did some light testing and seems to be okay. will reopen if any issues persist export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0
export TRAIN_ENABLE_SHARE_CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0,1 and then in code: import ray
from ray.train import ScalingConfig
from torchvision.models import resnet18
import ray.train.torch
import torch
import time
import os
ray.init()
def trainpred_func2(config):
print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
model = resnet18(num_classes=10)
cuda_dev=torch.device('cuda',ray.train.get_context().get_local_rank())
model=ray.train.torch.prepare_model(model,cuda_dev)
time.sleep(10)
scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func2,scaling_config=scaling_config)
trainer.fit() |
What happened + What you expected to happen
I have a machine with 2 GPUs. In my use case, I need to run multiple TorchTrainers concurrently on different datasets.
Ray Train is setup to avoid fragmenting jobs across different GPUs, but in this case this leads to an obvious NCCL error:
I would expect that when using Ray Train, the TorchTrainer internally knows that it should provide CUDA_VISIBLE_DEVICES equivalent to a "spread" across GPUs as opposed to a "pack" onto a single GPU.
I can get "something" that looks like the correct behavior if launching the first train job with GPU:.6, so it spreads across both jobs, and then launching a second train job with GPU:.4.
However, if you run a job with anything < .5, then both workers will try to use the same GPU resulting in the NCCL error.
I tried playing aggressively with something like:
but internally the Ray device_manager really hates this and there doesn't seem to be working combination that allows it to slide through.
Any thoughts?
Versions / Dependencies
Ray 2.37, Python 3.10
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: