[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053

RocketRider · 2025-01-24T17:20:13Z

What happened + What you expected to happen

The following code:

from ray.rllib.utils import try_import_torch

torch, _ = try_import_torch()

from ray.air._internal.torch_utils import get_devices

devices = get_devices()
print(devices)

returns "[device(type='cpu')]"
Even tough pytorch has full access to a cuda device.
torch.device(0)
is returning the cuda:0 device.

If I just replace the function to return the cuda device, the model is running perfectly on the gpu.

This function is used in multi_agent_env_runner and torch_learner causing the model to be transferred to the cpu.
https://github.com/ray-project/ray/blob/master/rllib/env/multi_agent_env_runner.py#L94
https://github.com/ray-project/ray/blob/master/rllib/core/learner/torch/torch_learner.py#L449

Versions / Dependencies

It was running fine with ray 2.40.0. The issue is reproduceable with 2.41.0 running on linux and windows
Tested with Pytorch 2.3.1 and 2.51. Same issue.

Reproduction script

from ray.rllib.utils import try_import_torch

torch, _ = try_import_torch()

from ray.air._internal.torch_utils import get_devices

devices = get_devices()
print(devices)
print("")
print(torch.device(0))

Issue Severity

None

The text was updated successfully, but these errors were encountered:

simonsays1980 · 2025-01-24T20:54:35Z

@RocketRider Thanks for rising this issue. I can reproduce. The backrgound on this is that get_devices only returns devices that are managed by Ray (e.g. if your algorithm runs inside a ray.tune trial. If this is not the case (i.e. if this is not called from inside of a Ray actor or task like in a local learner num_learners=0) the CUDA devices are not visible to Ray. In this case the device needs to be assigned manually (e.g. via torch.cuda).

The related PR fixes the ray.rllib.utils.framework.get_device that uses the ray.air._internal.torch_utils.get_devices internally by taking care of the case that the caller is not a Ray actor.

RocketRider · 2025-01-25T10:13:35Z

Thank you for the answer @simonsays1980
We used the algorithm directly without Tune and that caused the issue you are fixing in the PR.
To validate your comment and get it running I did setup Tune and now it is running fine. Thank you, I wouldn't have guessed to solve it like that.

RocketRider added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 24, 2025

simonsays1980 linked a pull request Jan 24, 2025 that will close this issue

[RLlib; Offline RL] - Enable single-learner/multi-learner GPU training. #50034

Open

8 tasks

simonsays1980 self-assigned this Jan 24, 2025

simonsays1980 added the P1 Issue that should be fixed within a few weeks label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053

[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053

RocketRider commented Jan 24, 2025 •

edited

Loading

simonsays1980 commented Jan 24, 2025

RocketRider commented Jan 25, 2025

[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053

[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053

Comments

RocketRider commented Jan 24, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

simonsays1980 commented Jan 24, 2025

RocketRider commented Jan 25, 2025

RocketRider commented Jan 24, 2025 •

edited

Loading