Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

Closed
choosehappy opened this issue Jan 21, 2025 · 2 comments
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@choosehappy
Copy link

What happened + What you expected to happen

Setting a GPU to a fractional value appears to cause RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to be ignored when using TorchTrainer, as demonstrated below:

I’m using Ray 2.24, and this works as expected

def trainpred_func(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    time.sleep(100)
    

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True)
trainer = ray.train.torch.TorchTrainer(trainpred_func,scaling_config=scaling_config)
trainer.fit()    

With output:

(RayTrainWorker pid=18626) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18539) Started distributed worker processes: 
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18626) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18625) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18626) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18626) os.environ['CUDA_VISIBLE_DEVICES']='0,1'
(RayTrainWorker pid=18625) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18625) os.environ['CUDA_VISIBLE_DEVICES']='0,1'

However adding a fractional GPU resource like this

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})

Now causes this output:

(RayTrainWorker pid=18351) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18260) Started distributed worker processes: 
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18351) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18352) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18352) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18352) os.environ['CUDA_VISIBLE_DEVICES']='0'
(RayTrainWorker pid=18351) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18351) os.environ['CUDA_VISIBLE_DEVICES']='0'

We're still attempting to work elegantly around the lack of GPU spreading, as discussed here #48012 . Self management of the GPUs would be an easy acceptable solution!

Versions / Dependencies

ray==2.40.0
Python 3.10.12
Docker container: nvcr.io/nvidia/pytorch:24.08-py3

Reproduction script

As provided above

Issue Severity

High: It blocks me from completing my task.

@choosehappy choosehappy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 21, 2025
@jcotant1 jcotant1 added the train Ray Train Related Issue label Jan 21, 2025
@justinvyu
Copy link
Contributor

@choosehappy RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES should work to prevent the Ray Actor from overwriting the CUDA_VISIBLE_DEVICES, but Ray Train actually will set it by default so that all workers on a node can see all other devices used by workers on that node.

In the case of fractional GPUs, since the set of devices used by all workers is just a single GPU, there's only '0' as the visible device.

You can disable this Ray Train behavior with TRAIN_ENABLE_SHARE_CUDA_VISIBLE_DEVICES=0.

We set this default because workers on the same node should be able to do cross-GPU communication, but we exclude unused GPUs since the actual worker group doesn't need to communicate with them.

@choosehappy
Copy link
Author

Yea, cool! That little nugget appears to work! Thanks for pointing it out!

One note for those who stumble upon this, for it to work successfully you must a priori explicitly set CUDA_VISIBLE_DEVICES otherwise you will obtain this error:

2025-01-28 15:56:19,047 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_64274_00000]
ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=31654, ip=172.17.0.4, actor_id=87d4a857450cc5109f7b76c401000000, repr=TorchTrainer)
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
    self._ret = self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
  File "/usr/local/lib/python3.10/dist-packages/ray/train/base_trainer.py", line 799, in _trainable_func
    super()._trainable_func(self._merged_config)
  File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func
    output = fn()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/base_trainer.py", line 107, in _train_coordinator_fn
    trainer.training_loop()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/data_parallel_trainer.py", line 460, in training_loop
    training_iterator = self._training_iterator_cls(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 51, in __init__
    self._start_training(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 76, in _start_training
    self._run_with_error_handling(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 89, in _run_with_error_handling
    return func()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 77, in <lambda>
    lambda: self._backend_executor.start_training(
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/backend_executor.py", line 535, in start_training
    self._backend.on_training_start(self.worker_group, self._backend_config)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/config.py", line 210, in on_training_start
    worker_group.execute(_set_torch_distributed_env_vars)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 272, in execute
    return ray.get(self.execute_async(func, *args, **kwargs))
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute._set_torch_distributed_env_vars() (pid=31778, ip=172.17.0.4, actor_id=fa58c4a0c38f44f862cb43ec01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fa7fb460d30>)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 30, in __execute
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/config.py", line 146, in _set_torch_distributed_env_vars
    device = get_device()
  File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/train_loop_utils.py", line 107, in get_device
    return torch_utils.get_devices()[0]
  File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/torch_utils.py", line 47, in get_devices
    device_ids.append(cuda_visible_list.index(gpu_id))
ValueError: '0' is not in list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants