Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLLIB] Connector Pipeline is very slow #50011

Open
brieyla1 opened this issue Jan 22, 2025 · 1 comment
Open

[RLLIB] Connector Pipeline is very slow #50011

brieyla1 opened this issue Jan 22, 2025 · 1 comment
Labels
bug Something that is supposed to be working; but isn't rllib RLlib related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@brieyla1
Copy link

What happened + What you expected to happen

Hey !

I was Investigating RLLIB's pipelines and connectors, a nice dx upgrade from the previous messier and more hardcoded implementation of rllib.

Through my tests, I've noticed that most of the time spent in the learning_step was done in the pipeline rather than in the model, with only 19% of the time being in the model class doing machine learning.

Around 70% of the time is taken by the pipeline V2 -- 30% of which just add the time dimension and zero pad the data, and 25% adding cols to train batch. (see flamegraph below)

I have a feeling that this could be sped up by allowing the space_utils's batch method to work with tensors directly when possible rather than numpying the lists every time.

# https://github.com/ray-project/ray/blob/master/rllib/utils/spaces/space_utils.py#L325C1-L375C1

@DeveloperAPI
def batch(
    list_of_structs: List[Any],
    *,
    individual_items_already_have_batch_dim: Union[bool, str] = False,
):
    ....
    if individual_items_already_have_batch_dim == "auto":
        flat = tree.flatten(list_of_structs[0])
        individual_items_already_have_batch_dim = isinstance(flat[0], BatchedNdArray)

    np_func = np.concatenate if individual_items_already_have_batch_dim else np.stack
    ret = tree.map_structure(lambda *s: np_func(s, axis=0), *list_of_structs)
    return ret

If we go down the rabbit hole of checking the time flamegraph consumption of every item of the pipeline, the biggest consumer and culprit is generally a map_structure call. Keeping everything a tensor (and keeping on device is GPU / CPU) when possible would drastically speed up these processes and avoid unnecessary GPU -> CPU -> GPU calls that seem to bottleneck the performance.

Image

I'm going to try and optimize a few of these, I wonder if @sven1977 the wizard has any insight on this. Updating this could theoritically make a lot of the enterprise hardware get a free 30-40% speed increase with no compromise on RLLIB new api.

Nota bene: Some optimizations could also be made to the env_runner connectors too, as something similar is happening. where a big portion of the time is spent creating the action distribution class in python rather than computing the logits and the action itself. But this is another problem for another day.

Versions / Dependencies

GH200 machine | 64 GRACE CPU cores (arm), 463.9 GB RAM, 4.4 TB SSD, H100

Ubuntu 22.04.5 LTS
Python 3.12.2
Ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-manylinux2014_aarch64.whl#sha256=9131183c34d496b7b3f3034c227f70e66891653a0d2313c6fc1d7376bd7a6e66

Reproduction script

batch_size=256K steps
minibatch=8K steps
num_epochs=4 epochs
4 env runners each 256 envs

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@brieyla1 brieyla1 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025
@jcotant1 jcotant1 added the rllib RLlib related issues label Jan 23, 2025
@brieyla1
Copy link
Author

brieyla1 commented Jan 23, 2025

Hey all !

Although it may not be as good as the suggested implementation of keeping the data as tensors & such. I appreciate the amount of work that needs to be done to work this out. Hence, I've created a quick plug-n-play speedup solution that improves the speed of the bad connectors without changing the data output type, structure and shape. Nothing changes besides the fact that less time is spent in the connectors. take a look: #50035

NOTE: Performance in the connectors is still horrendous, but it went from +-70% to +-40% (see flamegraph below vs above) This is essentially a free speed-up.

Image

Something should be done about the amount of time spent in these connectors and pipelines.
with a good GPU & batch size; the env_runner spends only 4.62% of the time making the inference, which leaves 95% of the time doing something else than inference (could be a >10x in performance for exploration !)

Here is the env_runner flamegraph -- the part on the extreme left is the model working. everything else is either connectors, gymnasium converting lists to arrays, space_utils batching data or pickling.
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't rllib RLlib related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants