[RLLIB] Connector Pipeline is very slow #50011

brieyla1 · 2025-01-22T16:38:19Z

What happened + What you expected to happen

Hey !

I was Investigating RLLIB's pipelines and connectors, a nice dx upgrade from the previous messier and more hardcoded implementation of rllib.

Through my tests, I've noticed that most of the time spent in the learning_step was done in the pipeline rather than in the model, with only 19% of the time being in the model class doing machine learning.

Around 70% of the time is taken by the pipeline V2 -- 30% of which just add the time dimension and zero pad the data, and 25% adding cols to train batch. (see flamegraph below)

I have a feeling that this could be sped up by allowing the space_utils's batch method to work with tensors directly when possible rather than numpying the lists every time.

# https://github.com/ray-project/ray/blob/master/rllib/utils/spaces/space_utils.py#L325C1-L375C1

@DeveloperAPI
def batch(
    list_of_structs: List[Any],
    *,
    individual_items_already_have_batch_dim: Union[bool, str] = False,
):
    ....
    if individual_items_already_have_batch_dim == "auto":
        flat = tree.flatten(list_of_structs[0])
        individual_items_already_have_batch_dim = isinstance(flat[0], BatchedNdArray)

    np_func = np.concatenate if individual_items_already_have_batch_dim else np.stack
    ret = tree.map_structure(lambda *s: np_func(s, axis=0), *list_of_structs)
    return ret

If we go down the rabbit hole of checking the time flamegraph consumption of every item of the pipeline, the biggest consumer and culprit is generally a map_structure call. Keeping everything a tensor (and keeping on device is GPU / CPU) when possible would drastically speed up these processes and avoid unnecessary GPU -> CPU -> GPU calls that seem to bottleneck the performance.

I'm going to try and optimize a few of these, I wonder if @sven1977 the wizard has any insight on this. Updating this could theoritically make a lot of the enterprise hardware get a free 30-40% speed increase with no compromise on RLLIB new api.

Nota bene: Some optimizations could also be made to the env_runner connectors too, as something similar is happening. where a big portion of the time is spent creating the action distribution class in python rather than computing the logits and the action itself. But this is another problem for another day.

Versions / Dependencies

GH200 machine | 64 GRACE CPU cores (arm), 463.9 GB RAM, 4.4 TB SSD, H100

Ubuntu 22.04.5 LTS
Python 3.12.2
Ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-manylinux2014_aarch64.whl#sha256=9131183c34d496b7b3f3034c227f70e66891653a0d2313c6fc1d7376bd7a6e66

Reproduction script

batch_size=256K steps
minibatch=8K steps
num_epochs=4 epochs
4 env runners each 256 envs

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

brieyla1 · 2025-01-23T11:38:29Z

Hey all !

Although it may not be as good as the suggested implementation of keeping the data as tensors & such. I appreciate the amount of work that needs to be done to work this out. Hence, I've created a quick plug-n-play speedup solution that improves the speed of the bad connectors without changing the data output type, structure and shape. Nothing changes besides the fact that less time is spent in the connectors. take a look: #50035

NOTE: Performance in the connectors is still horrendous, but it went from +-70% to +-40% (see flamegraph below vs above) This is essentially a free speed-up.

Something should be done about the amount of time spent in these connectors and pipelines.
with a good GPU & batch size; the env_runner spends only 4.62% of the time making the inference, which leaves 95% of the time doing something else than inference (could be a >10x in performance for exploration !)

Here is the env_runner flamegraph -- the part on the extreme left is the model working. everything else is either connectors, gymnasium converting lists to arrays, space_utils batching data or pickling.

brieyla1 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025

jcotant1 added the rllib RLlib related issues label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLLIB] Connector Pipeline is very slow #50011

[RLLIB] Connector Pipeline is very slow #50011

brieyla1 commented Jan 22, 2025

brieyla1 commented Jan 23, 2025 •

edited

Loading

[RLLIB] Connector Pipeline is very slow #50011

[RLLIB] Connector Pipeline is very slow #50011

Comments

brieyla1 commented Jan 22, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

brieyla1 commented Jan 23, 2025 • edited Loading

brieyla1 commented Jan 23, 2025 •

edited

Loading