-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985
Comments
@choosehappy In the case of fractional GPUs, since the set of devices used by all workers is just a single GPU, there's only You can disable this Ray Train behavior with We set this default because workers on the same node should be able to do cross-GPU communication, but we exclude unused GPUs since the actual worker group doesn't need to communicate with them. |
Yea, cool! That little nugget appears to work! Thanks for pointing it out! One note for those who stumble upon this, for it to work successfully you must a priori explicitly set CUDA_VISIBLE_DEVICES otherwise you will obtain this error: 2025-01-28 15:56:19,047 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_64274_00000]
ray.exceptions.RayTaskError(ValueError): ray::_Inner.train() (pid=31654, ip=172.17.0.4, actor_id=87d4a857450cc5109f7b76c401000000, repr=TorchTrainer)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
training_func=lambda: self._trainable_func(self.config),
File "/usr/local/lib/python3.10/dist-packages/ray/train/base_trainer.py", line 799, in _trainable_func
super()._trainable_func(self._merged_config)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func
output = fn()
File "/usr/local/lib/python3.10/dist-packages/ray/train/base_trainer.py", line 107, in _train_coordinator_fn
trainer.training_loop()
File "/usr/local/lib/python3.10/dist-packages/ray/train/data_parallel_trainer.py", line 460, in training_loop
training_iterator = self._training_iterator_cls(
File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 51, in __init__
self._start_training(
File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 76, in _start_training
self._run_with_error_handling(
File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 89, in _run_with_error_handling
return func()
File "/usr/local/lib/python3.10/dist-packages/ray/train/trainer.py", line 77, in <lambda>
lambda: self._backend_executor.start_training(
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/backend_executor.py", line 535, in start_training
self._backend.on_training_start(self.worker_group, self._backend_config)
File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/config.py", line 210, in on_training_start
worker_group.execute(_set_torch_distributed_env_vars)
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 272, in execute
return ray.get(self.execute_async(func, *args, **kwargs))
ray.exceptions.RayTaskError(ValueError): ray::_RayTrainWorker__execute._set_torch_distributed_env_vars() (pid=31778, ip=172.17.0.4, actor_id=fa58c4a0c38f44f862cb43ec01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fa7fb460d30>)
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/train/_internal/worker_group.py", line 30, in __execute
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/config.py", line 146, in _set_torch_distributed_env_vars
device = get_device()
File "/usr/local/lib/python3.10/dist-packages/ray/train/torch/train_loop_utils.py", line 107, in get_device
return torch_utils.get_devices()[0]
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/torch_utils.py", line 47, in get_devices
device_ids.append(cuda_visible_list.index(gpu_id))
ValueError: '0' is not in list |
What happened + What you expected to happen
Setting a GPU to a fractional value appears to cause RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to be ignored when using TorchTrainer, as demonstrated below:
I’m using Ray 2.24, and this works as expected
With output:
However adding a fractional GPU resource like this
Now causes this output:
We're still attempting to work elegantly around the lack of GPU spreading, as discussed here #48012 . Self management of the GPUs would be an easy acceptable solution!
Versions / Dependencies
ray==2.40.0
Python 3.10.12
Docker container: nvcr.io/nvidia/pytorch:24.08-py3
Reproduction script
As provided above
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: