-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib; Offline RL] - Enable single-learner/multi-learner GPU training. #50034
base: master
Are you sure you want to change the base?
[RLlib; Offline RL] - Enable single-learner/multi-learner GPU training. #50034
Conversation
…earner-mode. Took also care of the case that a local learner is used and a GPU without Ray Tune, which did not work before on any algorithm. Made all Offline RL algorithms GPU-trainable. Modified the 'Learner.update_from_iter' such that exhausted iterators will be run again until the desired 'dataset_num_iter_per_learner' is reached. Added also metrics for the latter parameter to check, if the desired iterations were indeed run by the learner. Signed-off-by: simonsays1980 <[email protected]>
…lib iteration and issues a warning when using Ray Tune. Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
rllib/core/learner/learner.py
Outdated
@@ -1180,6 +1184,18 @@ def _finalize_fn(batch: Dict[str, numpy.ndarray]) -> Dict[str, Any]: | |||
value=loss, | |||
window=1, | |||
) | |||
# Record the number of batches pulled from the dataset in this RLlib iteration. | |||
self.metrics.log_value( | |||
DATASET_NUM_ITERS_PER_LEARNER_TRAINED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, this explains my question above. So this will be merged on the algo side. Which means that this metrics is NOT per Learner, but across all Learners.
-> DATASET_NUM_ITERS_TRAINED
and DATASET_NUM_ITERS_TRAINED_LIEFTIME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got ya now. Yes this will be aggregated then.
@@ -414,9 +414,15 @@ def _learner_update( | |||
" local mode! Try setting `config.num_learners > 0`." | |||
) | |||
|
|||
if isinstance(batch, list) and isinstance(batch[0], ray.ObjectRef): | |||
if isinstance(batch, list): | |||
# Ensure we are not in a multi-learner setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
Signed-off-by: Sven Mika <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome progress with this PR. Thanks so much @simonsays1980 for the hard work. A few suggestions before we can merge.
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
…ay-project#49904) Signed-off-by: Rui Qiao <[email protected]>
…default) (ray-project#49971) The current Ray metrics implementation adds nontrivial overhead per call to `metric._record`. This is a workaround to avoid calling it multiple times for every request. The default is set to 100ms but can be adjusted using the `RAY_SERVE_METRICS_EXPORT_INTERVAL_MS` environment variable (set to `0` reverts to the previous behavior). --------- Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: wa101 <[email protected]>
The current documentation of the `BayesOptSearch` is incomplete/wrong. - It doesn't specify the version of the package which should be installed (`1.4.3` as far as I can tell: link) - There is a link to an example notebook, but it is not version pinned, instead pointing to `master`. - References to the repo are still going to @fmfn's personal repository and not the the organization repository. Signed-off-by: till-m <[email protected]>
so that it does not pull from anaconda public repos by default. --------- Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: pcmoritz <[email protected]> Co-authored-by: pcmoritz <[email protected]>
…-project#49906) Signed-off-by: dentiny <[email protected]>
This PR contains an architecture to support defining DashboardModules as subprocesses. It also contains an example TestModule in unit test. When such a module is created, we have: - a pair of `multiprocessing.Queue`: child_bound_queue, parent_bound_queue - serializable message types for both directions - child process: a module instance that is subclass of SubprocessModule - child process: event loop - child process: dedicated thread to listen from the child_bound_queue and dispatch messages to SubprocessModule - parent process: a instance of SubprocessModuleHandle - parent process: dedicated thread to listen from the parent_bound_queue and dispatch messages to SubprocessModuleHandle - SubprocessRouteTable that provides decorators to define HTTP endpoints. To accept subprocess modules: - http_dashboard_head.py need to create SubprocessModuleHandle, one for each class type - http_dashboard_head.py to bind SubprocessModuleHandle to SubprocessRouteTable - http_dashboard_head.py to add SubprocessRouteTable to its aiohttp.web.Application To define a module: - create a new file (or modify existing one) - define a MyModule as subclass of SubprocessModule - define handlers as `async def my_method(self, request_body: bytes) -> aiohttp.web.Response:` - use `@SubprocessRouteTable.get("/my_method")` to wrap handlers - get registered by http_dashboard_head.py On user end: exactly the same --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Co-authored-by: Chi-Sheng Liu <[email protected]>
Fix broken build on windows platform. --------- Signed-off-by: dentiny <[email protected]>
so that it is being built in the next ray release Signed-off-by: Lonnie Liu <[email protected]>
so that the version is being pinned and tracked. the dependency was added in ray-project#49732 Signed-off-by: Lonnie Liu <[email protected]>
…specific source types (ray-project#49541) Signed-off-by: Nikita Vemuri <[email protected]>
no longer being used anywhere; all release tests are byod based now. Signed-off-by: Lonnie Liu <[email protected]>
This change runs the existing pre-commit hooks on all files in the CI. Other hooks will be enabled one by one in follow-up PRs, as enabling them will cause file changes. Other hooks that are not enabled in this PR: trailing-whitespace end-of-file-fixer check-json python-no-log-warn shellcheck pretty-format-java Once all hooks are enabled, they will be replaced by a single command pre-commit run --all-files --show-diff-on-failure, ensuring that all future pre-commit hooks are run in CI. --------- Signed-off-by: Chi-Sheng Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
…ect#49753) Signed-off-by: Rui Qiao <[email protected]>
The Ray Data autoscaler isn't aware of operators' resource budgets, so it often makes suboptimal scheduling decisions. To address this issue, this PR exposes the budgets with the `get_budget` method. --------- Signed-off-by: Balaji Veeramani <[email protected]>
…ailure (ray-project#49884) The `read_images` benchmark sometimes fails with this issue: ray-project#49883. To workaround this, this PR explicitly specifies the mode in the `read_images` call. --------- Signed-off-by: Balaji Veeramani <[email protected]>
…t#49878) ReportHead used to subscribe to DataSource.agents changes, to maintain a connection to every single O(#node) agents, just to make grpc calls when needed. This PR removes it, now it only make on demand connections when a profiling request comes. Changes: - No longer listen to DataSource.agents. - On profiling request, make 1 rpc to InternalKV, to get IP and agent port. Then create a stub on the fly and use it to send outbound requests. - Changed InternalKV: - key: `DASHBOARD_AGENT_PORT_PREFIX` -> `DASHBOARD_AGENT_ADDR_PREFIX` - value: `json.dumps([http_port, grpc_port])` -> `json.dumps([ip, http_port, grpc_port])` - Changed all profiling requests from giving param `ip` to give param `node_id`. - Added NodeID filter to GcsClient API. - Moved updates of `DataSource.node_physical_stats` to NodeHead. After this PR, ReportHead no longer needs DataSource. Smoke tested UI locally. --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
A variable that is defined but not used is likely a mistake, and should be removed to avoid confusion. ref: https://docs.astral.sh/ruff/rules/unused-variable/ --------- Signed-off-by: fscnick <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
…monsays1980/ray into offline-rl-enable-gpu-training-simple
Signed-off-by: simonsays1980 <[email protected]>
…a method. Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Why are these changes needed?
GPU training in Offline RL was not implemented, yet. This PR proposes multiple changes to enable GPU training in single- and multi-learner mode:
NumpyToTensor
connector in the learner connector pipeline in case ofnum_gpus_per_learner>0
TorchLearner
manages byray.rllib.utils.framework.get_devices
by adding a test for theWORKER_MODE
- a local learner is not inWORKER_MODE
and therefore will not get a GPU otherwise (note, this holds true for ALL RLlib ALGOS).Learner.update_from_batch_or_episodes
to test forMultiAgentBatch
es that are not holding tensors, yet - this is the offline local learner GPU case.Learner.update_from_iterator
to use- An outer loop that takes care of iterator exhaustion in case the desired number of updates per RLlib iteration are not run, yet.
- Metrics for the
dataset_num_iters_per_learner
to get an eye on desired vs. realized updates per RLlib iteration.learner
parameter from theOfflinePreLearner
ctor as it is no longer needed.module_spec
andmodule_state
to theOfflinePreLearner
ctor and adds corresponding logic toAlgorithm.build
to provide these two arguments during initialization.Learner.update_from_iterator
which enables faster updates because each iterator is run without interruption. This includes changes toLearnerGroup._update
to handle cases with alist
batch
that contains a singleDataIterator
.OfflineData.sample
calls in all Offline RL algos to use an iterator for updates as long asdataset_num_iters_per_learner !=1
.BUILD
.Related issue number
Closes #50053
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.