[RLlib 2.41.0] get_device only returns cpu this causes num_gpus_per_env_runner and num_gpus_per_learner not working #50053
Labels
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
rllib
RLlib related issues
rllib-gpu-multi-gpu
RLlib issues that's related to running on one or multiple GPUs
rllib-newstack
What happened + What you expected to happen
The following code:
returns "[device(type='cpu')]"
Even tough pytorch has full access to a cuda device.
torch.device(0)
is returning the cuda:0 device.
If I just replace the function to return the cuda device, the model is running perfectly on the gpu.
This function is used in multi_agent_env_runner and torch_learner causing the model to be transferred to the cpu.
https://github.com/ray-project/ray/blob/master/rllib/env/multi_agent_env_runner.py#L94
https://github.com/ray-project/ray/blob/master/rllib/core/learner/torch/torch_learner.py#L449
Versions / Dependencies
It was running fine with ray 2.40.0. The issue is reproduceable with 2.41.0 running on linux and windows
Tested with Pytorch 2.3.1 and 2.51. Same issue.
Reproduction script
Issue Severity
None
The text was updated successfully, but these errors were encountered: