[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

choosehappy · 2025-01-21T18:28:17Z

What happened + What you expected to happen

Setting a GPU to a fractional value appears to cause RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to be ignored when using TorchTrainer, as demonstrated below:

I’m using Ray 2.24, and this works as expected

def trainpred_func(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    time.sleep(100)
    

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True)
trainer = ray.train.torch.TorchTrainer(trainpred_func,scaling_config=scaling_config)
trainer.fit()

With output:

(RayTrainWorker pid=18626) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18539) Started distributed worker processes: 
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18626) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18625) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18626) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18626) os.environ['CUDA_VISIBLE_DEVICES']='0,1'
(RayTrainWorker pid=18625) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18625) os.environ['CUDA_VISIBLE_DEVICES']='0,1'

However adding a fractional GPU resource like this

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})

Now causes this output:

(RayTrainWorker pid=18351) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18260) Started distributed worker processes: 
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18351) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18352) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18352) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18352) os.environ['CUDA_VISIBLE_DEVICES']='0'
(RayTrainWorker pid=18351) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18351) os.environ['CUDA_VISIBLE_DEVICES']='0'

We're still attempting to work elegantly around the lack of GPU spreading, as discussed here #48012 . Self management of the GPUs would be an easy acceptable solution!

Versions / Dependencies

ray==2.40.0
Python 3.10.12
Docker container: nvcr.io/nvidia/pytorch:24.08-py3

Reproduction script

As provided above

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

justinvyu · 2025-01-27T23:10:51Z

@choosehappy RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES should work to prevent the Ray Actor from overwriting the CUDA_VISIBLE_DEVICES, but Ray Train actually will set it by default so that all workers on a node can see all other devices used by workers on that node.

In the case of fractional GPUs, since the set of devices used by all workers is just a single GPU, there's only '0' as the visible device.

You can disable this Ray Train behavior with TRAIN_ENABLE_SHARE_CUDA_VISIBLE_DEVICES=0.

We set this default because workers on the same node should be able to do cross-GPU communication, but we exclude unused GPUs since the actual worker group doesn't need to communicate with them.

choosehappy · 2025-01-28T15:58:09Z

Yea, cool! That little nugget appears to work! Thanks for pointing it out!

One note for those who stumble upon this, for it to work successfully you must a priori explicitly set CUDA_VISIBLE_DEVICES otherwise you will obtain this error:

2025-01-28 15:56:19,047 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_64274_00000]
ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=31654, ip=172.17.0.4, actor_id=87d4a857450cc5109f7b76c401000000, repr=TorchTrainer)
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
    self._ret = self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
  File "/usr/local/lib/python3.10/dist-packages/ray/train/base_trainer.py", line 799, in _trainable_func
    super()._trainable_func(self._merged_config)
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func
    output = fn()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/base_trainer.py", line 107, in _train_coordinator_fn
    trainer.training_loop()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/data_parallel_trainer.py", line 460, in training_loop
    training_iterator = self._training_iterator_cls(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 51, in __init__
    self._start_training(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 76, in _start_training
    self._run_with_error_handling(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 89, in _run_with_error_handling
    return func()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 77, in <lambda>
    lambda: self._backend_executor.start_training(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/backend_executor.py", line 535, in start_training
    self._backend.on_training_start(self.worker_group, self._backend_config)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/config.py", line 210, in on_training_start
    worker_group.execute(_set_torch_distributed_env_vars)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 272, in execute
    return ray.get(self.execute_async(func, *args, **kwargs))
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute._set_torch_distributed_env_vars() (pid=31778, ip=172.17.0.4, actor_id=fa58c4a0c38f44f862cb43ec01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fa7fb460d30>)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 30, in __execute
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/config.py", line 146, in _set_torch_distributed_env_vars
    device = get_device()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/train_loop_utils.py", line 107, in get_device
    return torch_utils.get_devices()[0]
  File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/torch_utils.py", line 47, in get_devices
    device_ids.append(cuda_visible_list.index(gpu_id))
ValueError: '0' is not in list

choosehappy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 21, 2025

jcotant1 added the train Ray Train Related Issue label Jan 21, 2025

choosehappy closed this as completed Jan 28, 2025

choosehappy mentioned this issue Jan 28, 2025

[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

choosehappy commented Jan 21, 2025

justinvyu commented Jan 27, 2025

choosehappy commented Jan 28, 2025

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

Comments

choosehappy commented Jan 21, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

justinvyu commented Jan 27, 2025

choosehappy commented Jan 28, 2025