You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was Investigating RLLIB's pipelines and connectors, a nice dx upgrade from the previous messier and more hardcoded implementation of rllib.
Through my tests, I've noticed that most of the time spent in the learning_step was done in the pipeline rather than in the model, with only 19% of the time being in the model class doing machine learning.
Around 70% of the time is taken by the pipeline V2 -- 30% of which just add the time dimension and zero pad the data, and 25% adding cols to train batch. (see flamegraph below)
I have a feeling that this could be sped up by allowing the space_utils's batch method to work with tensors directly when possible rather than numpying the lists every time.
If we go down the rabbit hole of checking the time flamegraph consumption of every item of the pipeline, the biggest consumer and culprit is generally a map_structure call. Keeping everything a tensor (and keeping on device is GPU / CPU) when possible would drastically speed up these processes and avoid unnecessary GPU -> CPU -> GPU calls that seem to bottleneck the performance.
I'm going to try and optimize a few of these, I wonder if @sven1977 the wizard has any insight on this. Updating this could theoritically make a lot of the enterprise hardware get a free 30-40% speed increase with no compromise on RLLIB new api.
Nota bene: Some optimizations could also be made to the env_runner connectors too, as something similar is happening. where a big portion of the time is spent creating the action distribution class in python rather than computing the logits and the action itself. But this is another problem for another day.
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
brieyla1
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 22, 2025
Although it may not be as good as the suggested implementation of keeping the data as tensors & such. I appreciate the amount of work that needs to be done to work this out. Hence, I've created a quick plug-n-play speedup solution that improves the speed of the bad connectors without changing the data output type, structure and shape. Nothing changes besides the fact that less time is spent in the connectors. take a look: #50035
NOTE: Performance in the connectors is still horrendous, but it went from +-70% to +-40% (see flamegraph below vs above) This is essentially a free speed-up.
Something should be done about the amount of time spent in these connectors and pipelines.
with a good GPU & batch size; the env_runner spends only 4.62% of the time making the inference, which leaves 95% of the time doing something else than inference (could be a >10x in performance for exploration !)
Here is the env_runner flamegraph -- the part on the extreme left is the model working. everything else is either connectors, gymnasium converting lists to arrays, space_utilsbatching data or pickling.
What happened + What you expected to happen
Hey !
I was Investigating RLLIB's pipelines and connectors, a nice dx upgrade from the previous messier and more hardcoded implementation of rllib.
Through my tests, I've noticed that most of the time spent in the
learning_step
was done in thepipeline
rather than in themodel
, with only 19% of the time being in themodel
class doing machine learning.Around 70% of the time is taken by the pipeline V2 -- 30% of which just add the time dimension and zero pad the data, and 25% adding cols to train batch. (see flamegraph below)
I have a feeling that this could be sped up by allowing the
space_utils
'sbatch
method to work withtensors
directly when possible rather thannumpy
ing the lists every time.If we go down the rabbit hole of checking the time flamegraph consumption of every item of the pipeline, the biggest consumer and culprit is generally a
map_structure
call. Keeping everything atensor
(and keeping on device is GPU / CPU) when possible would drastically speed up these processes and avoid unnecessaryGPU -> CPU -> GPU
calls that seem to bottleneck the performance.I'm going to try and optimize a few of these, I wonder if @sven1977 the wizard has any insight on this. Updating this could theoritically make a lot of the enterprise hardware get a free 30-40% speed increase with no compromise on RLLIB new api.
Nota bene: Some optimizations could also be made to the
env_runner
connectors too, as something similar is happening. where a big portion of the time is spent creating the action distribution class in python rather than computing the logits and the action itself. But this is another problem for another day.Versions / Dependencies
GH200 machine | 64 GRACE CPU cores (arm), 463.9 GB RAM, 4.4 TB SSD, H100
Ubuntu 22.04.5 LTS
Python 3.12.2
Ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-manylinux2014_aarch64.whl#sha256=9131183c34d496b7b3f3034c227f70e66891653a0d2313c6fc1d7376bd7a6e66
Reproduction script
batch_size=256K steps
minibatch=8K steps
num_epochs=4 epochs
4 env runners each 256 envs
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: