Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053

Open
RocketRider opened this issue Jan 24, 2025 · 2 comments · May be fixed by #50034
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks rllib RLlib related issues rllib-gpu-multi-gpu RLlib issues that's related to running on one or multiple GPUs rllib-newstack

Comments

@RocketRider
Copy link
Contributor

RocketRider commented Jan 24, 2025

What happened + What you expected to happen

The following code:

from ray.rllib.utils import try_import_torch

torch, _ = try_import_torch()

from ray.air._internal.torch_utils import get_devices

devices = get_devices()
print(devices)

returns "[device(type='cpu')]"
Even tough pytorch has full access to a cuda device.
torch.device(0)
is returning the cuda:0 device.

If I just replace the function to return the cuda device, the model is running perfectly on the gpu.

This function is used in multi_agent_env_runner and torch_learner causing the model to be transferred to the cpu.
https://github.com/ray-project/ray/blob/master/rllib/env/multi_agent_env_runner.py#L94
https://github.com/ray-project/ray/blob/master/rllib/core/learner/torch/torch_learner.py#L449

Versions / Dependencies

It was running fine with ray 2.40.0. The issue is reproduceable with 2.41.0 running on linux and windows
Tested with Pytorch 2.3.1 and 2.51. Same issue.

Reproduction script

from ray.rllib.utils import try_import_torch

torch, _ = try_import_torch()

from ray.air._internal.torch_utils import get_devices

devices = get_devices()
print(devices)
print("")
print(torch.device(0))

Issue Severity

None

@RocketRider RocketRider added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 24, 2025
@simonsays1980 simonsays1980 self-assigned this Jan 24, 2025
@simonsays1980 simonsays1980 added the P1 Issue that should be fixed within a few weeks label Jan 24, 2025
@simonsays1980
Copy link
Collaborator

@RocketRider Thanks for rising this issue. I can reproduce. The backrgound on this is that get_devices only returns devices that are managed by Ray (e.g. if your algorithm runs inside a ray.tune trial. If this is not the case (i.e. if this is not called from inside of a Ray actor or task like in a local learner num_learners=0) the CUDA devices are not visible to Ray. In this case the device needs to be assigned manually (e.g. via torch.cuda).

The related PR fixes the ray.rllib.utils.framework.get_device that uses the ray.air._internal.torch_utils.get_devices internally by taking care of the case that the caller is not a Ray actor.

@simonsays1980 simonsays1980 added rllib RLlib related issues rllib-system system issues, runtime env, oom, etc rllib-newstack rllib-gpu-multi-gpu RLlib issues that's related to running on one or multiple GPUs and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) rllib-system system issues, runtime env, oom, etc labels Jan 24, 2025
@RocketRider
Copy link
Contributor Author

Thank you for the answer @simonsays1980
We used the algorithm directly without Tune and that caused the issue you are fixing in the PR.
To validate your comment and get it running I did setup Tune and now it is running fine. Thank you, I wouldn't have guessed to solve it like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks rllib RLlib related issues rllib-gpu-multi-gpu RLlib issues that's related to running on one or multiple GPUs rllib-newstack
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants