Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib; Offline RL] - Enable single-learner/multi-learner GPU training. #50034

Open
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

simonsays1980
Copy link
Collaborator

@simonsays1980 simonsays1980 commented Jan 23, 2025

Why are these changes needed?

GPU training in Offline RL was not implemented, yet. This PR proposes multiple changes to enable GPU training in single- and multi-learner mode:

  • Disables the NumpyToTensor connector in the learner connector pipeline in case of num_gpus_per_learner>0
  • Changes the GPU assignment in the TorchLearner manages by ray.rllib.utils.framework.get_devices by adding a test for the WORKER_MODE - a local learner is not in WORKER_MODE and therefore will not get a GPU otherwise (note, this holds true for ALL RLlib ALGOS).
  • Adds another test to the Learner.update_from_batch_or_episodes to test for MultiAgentBatches that are not holding tensors, yet - this is the offline local learner GPU case.
  • Modifies the iteration loop in Learner.update_from_iterator to use
    - An outer loop that takes care of iterator exhaustion in case the desired number of updates per RLlib iteration are not run, yet.
    - Metrics for the dataset_num_iters_per_learner to get an eye on desired vs. realized updates per RLlib iteration.
  • Removes the learner parameter from the OfflinePreLearner ctor as it is no longer needed.
  • Instead adds a default module_spec and module_state to the OfflinePreLearner ctor and adds corresponding logic to Algorithm.build to provide these two arguments during initialization.
  • Added functionality for single-learner setups to make use of Learner.update_from_iterator which enables faster updates because each iterator is run without interruption. This includes changes to LearnerGroup._update to handle cases with a list batch that contains a single DataIterator.
  • Modified the OfflineData.sample calls in all Offline RL algos to use an iterator for updates as long as dataset_num_iters_per_learner !=1.
  • Added single GPU tests for all Offline RL algorithms to BUILD.

Related issue number

Closes #50053

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…earner-mode. Took also care of the case that a local learner is used and a GPU without Ray Tune, which did not work before on any algorithm. Made all Offline RL algorithms GPU-trainable. Modified the 'Learner.update_from_iter' such that exhausted iterators will be run again until the desired 'dataset_num_iter_per_learner' is reached. Added also metrics for the latter parameter to check, if the desired iterations were indeed run by the learner.

Signed-off-by: simonsays1980 <[email protected]>
…lib iteration and issues a warning when using Ray Tune.

Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
@simonsays1980 simonsays1980 marked this pull request as ready for review January 23, 2025 11:00
rllib/BUILD Show resolved Hide resolved
rllib/BUILD Show resolved Hide resolved
rllib/BUILD Show resolved Hide resolved
@@ -1180,6 +1184,18 @@ def _finalize_fn(batch: Dict[str, numpy.ndarray]) -> Dict[str, Any]:
value=loss,
window=1,
)
# Record the number of batches pulled from the dataset in this RLlib iteration.
self.metrics.log_value(
DATASET_NUM_ITERS_PER_LEARNER_TRAINED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, this explains my question above. So this will be merged on the algo side. Which means that this metrics is NOT per Learner, but across all Learners.

-> DATASET_NUM_ITERS_TRAINED and DATASET_NUM_ITERS_TRAINED_LIEFTIME

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got ya now. Yes this will be aggregated then.

@@ -414,9 +414,15 @@ def _learner_update(
" local mode! Try setting `config.num_learners > 0`."
)

if isinstance(batch, list) and isinstance(batch[0], ray.ObjectRef):
if isinstance(batch, list):
# Ensure we are not in a multi-learner setting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress with this PR. Thanks so much @simonsays1980 for the hard work. A few suggestions before we can merge.

@sven1977 sven1977 added rllib RLlib related issues rllib-offline-rl Offline RL problems rllib-newstack labels Jan 23, 2025
ruisearch42 and others added 30 commits January 23, 2025 19:10
…default) (ray-project#49971)


The current Ray metrics implementation adds nontrivial overhead per call
to `metric._record`. This is a workaround to avoid calling it multiple
times for every request.

The default is set to 100ms but can be adjusted using the
`RAY_SERVE_METRICS_EXPORT_INTERVAL_MS` environment variable (set to `0`
reverts to the previous behavior).

---------

Signed-off-by: Edward Oakes <[email protected]>
The current documentation of the `BayesOptSearch` is incomplete/wrong.
- It doesn't specify the version of the package which should be
installed (`1.4.3` as far as I can tell: link)
- There is a link to an example notebook, but it is not version pinned,
instead pointing to `master`.
- References to the repo are still going to @fmfn's personal repository
and not the the organization repository.

Signed-off-by: till-m <[email protected]>
so that it does not pull from anaconda public repos by default.

---------

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: pcmoritz <[email protected]>
Co-authored-by: pcmoritz <[email protected]>
This PR contains an architecture to support defining DashboardModules as
subprocesses. It also contains an example TestModule in unit test.

When such a module is created, we have:

- a pair of `multiprocessing.Queue`: child_bound_queue,
parent_bound_queue
- serializable message types for both directions
- child process: a module instance that is subclass of SubprocessModule
- child process: event loop
- child process: dedicated thread to listen from the child_bound_queue
and dispatch messages to SubprocessModule
- parent process: a instance of SubprocessModuleHandle
- parent process: dedicated thread to listen from the parent_bound_queue
and dispatch messages to SubprocessModuleHandle
- SubprocessRouteTable that provides decorators to define HTTP
endpoints.

To accept subprocess modules:
- http_dashboard_head.py need to create SubprocessModuleHandle, one for
each class type
- http_dashboard_head.py to bind SubprocessModuleHandle to
SubprocessRouteTable
- http_dashboard_head.py to add SubprocessRouteTable to its
aiohttp.web.Application

To define a module:
- create a new file (or modify existing one)
- define a MyModule as subclass of SubprocessModule
- define handlers as `async def my_method(self, request_body: bytes) ->
aiohttp.web.Response:`
- use `@SubprocessRouteTable.get("/my_method")` to wrap handlers
- get registered by http_dashboard_head.py

On user end:
exactly the same

---------

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: Chi-Sheng Liu <[email protected]>
Fix broken build on windows platform.

---------

Signed-off-by: dentiny <[email protected]>
so that it is being built in the next ray release

Signed-off-by: Lonnie Liu <[email protected]>
so that the version is being pinned and tracked.

the dependency was added in
ray-project#49732

Signed-off-by: Lonnie Liu <[email protected]>
no longer being used anywhere; all release tests are byod based now.

Signed-off-by: Lonnie Liu <[email protected]>
This change runs the existing pre-commit hooks on all files in the CI. Other hooks will be enabled one by one in follow-up PRs, as enabling them will cause file changes.

Other hooks that are not enabled in this PR:

trailing-whitespace
end-of-file-fixer
check-json
python-no-log-warn
shellcheck
pretty-format-java
Once all hooks are enabled, they will be replaced by a single command pre-commit run --all-files --show-diff-on-failure, ensuring that all future pre-commit hooks are run in CI.

---------

Signed-off-by: Chi-Sheng Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
The Ray Data autoscaler isn't aware of operators' resource budgets, so
it often makes suboptimal scheduling decisions. To address this issue,
this PR exposes the budgets with the `get_budget` method.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
…ailure (ray-project#49884)

The `read_images` benchmark sometimes fails with this issue:
ray-project#49883. To workaround this,
this PR explicitly specifies the mode in the `read_images` call.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
…t#49878)

ReportHead used to subscribe to DataSource.agents changes, to maintain a
connection to every single O(#node) agents, just to make grpc calls when
needed. This PR removes it, now it only make on demand connections when
a profiling request comes.

Changes:
- No longer listen to DataSource.agents.
- On profiling request, make 1 rpc to InternalKV, to get IP and agent
port. Then create a stub on the fly and use it to send outbound
requests.
- Changed InternalKV:
- key: `DASHBOARD_AGENT_PORT_PREFIX` -> `DASHBOARD_AGENT_ADDR_PREFIX`
- value: `json.dumps([http_port, grpc_port])` -> `json.dumps([ip,
http_port, grpc_port])`
- Changed all profiling requests from giving param `ip` to give param
`node_id`.
- Added NodeID filter to GcsClient API.
- Moved updates of `DataSource.node_physical_stats` to NodeHead.

After this PR, ReportHead no longer needs DataSource.

Smoke tested UI locally.

---------

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
A variable that is defined but not used is likely a mistake, and should be removed to avoid confusion.

ref: https://docs.astral.sh/ruff/rules/unused-variable/

---------

Signed-off-by: fscnick <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…monsays1980/ray into offline-rl-enable-gpu-training-simple
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rllib RLlib related issues rllib-newstack rllib-offline-rl Offline RL problems
Projects
None yet