Skip to content

Commit cb1d04a

Browse files
d4l3kfacebook-github-bot
authored andcommitted
ray_scheduler: workspace + fixed no role logging (#492)
Summary: This updates Ray to have proper workspace support. * `-c working_dir=...` is deprecated in favor of `torchx run --workspace=...` * `-c requirements=...` is optional and requirements.txt will be automatically read from the workspace if present * `torchx log ray://foo/bar` works without requiring `/ray/0` Pull Request resolved: #492 Test Plan: (torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx run -s ray --wait --log dist.ddp --env LOGLEVEL=INFO -j 2x1 -m scripts.compute_world_size torchx 2022-05-18 16:55:31 INFO Checking for changes in workspace `file:///home/tristanr/Developer/torchrec/examples/ray`... torchx 2022-05-18 16:55:31 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically. torchx 2022-05-18 16:55:31 INFO Built new image `/tmp/torchx_workspacebe6331jv` based on original image `ghcr.io/pytorch/torchx:0.2.0dev0` and changes in workspace `file:///home/tristanr/Developer/torch rec/examples/ray` for role[0]=compute_world_size. torchx 2022-05-18 16:55:31 WARNING The Ray scheduler does not support port mapping. torchx 2022-05-18 16:55:31 INFO Uploading package gcs://_ray_pkg_63a39f7096dfa0bd.zip. torchx 2022-05-18 16:55:31 INFO Creating a file package for local directory '/tmp/torchx_workspacebe6331jv'. ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td torchx 2022-05-18 16:55:31 INFO Launched app: ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td torchx 2022-05-18 16:55:31 INFO AppStatus: msg: PENDING num_restarts: -1 roles: - replicas: - hostname: <NONE> id: 0 role: ray state: !!python/object/apply:torchx.specs.api.AppState - 2 structured_error_msg: <NONE> role: ray state: PENDING (2) structured_error_msg: <NONE> ui_url: null torchx 2022-05-18 16:55:31 INFO Job URL: None torchx 2022-05-18 16:55:31 INFO Waiting for the app to finish... torchx 2022-05-18 16:55:31 INFO Waiting for app to start before logging... torchx 2022-05-18 16:55:43 INFO Job finished: SUCCEEDED (torchx) tristanr@tristanr-arch2 ~/D/t/e/ray (ray)> torchx log ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td ray/0 Waiting for placement group to start. ray/0 running ray.wait on [ObjectRef(8f2664c081ffc268e1c4275021ead9801a8d33861a00000001000000), ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)] ray/0 (CommandActor pid=494377) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: ray/0 (CommandActor pid=494377) entrypoint : scripts.compute_world_size ray/0 (CommandActor pid=494377) min_nodes : 2 ray/0 (CommandActor pid=494377) max_nodes : 2 ray/0 (CommandActor pid=494377) nproc_per_node : 1 ray/0 (CommandActor pid=494377) run_id : compute_world_size-mpr03nzqvvg3td ray/0 (CommandActor pid=494377) rdzv_backend : c10d ray/0 (CommandActor pid=494377) rdzv_endpoint : localhost:29500 ray/0 (CommandActor pid=494377) rdzv_configs : {'timeout': 900} ray/0 (CommandActor pid=494377) max_restarts : 0 ray/0 (CommandActor pid=494377) monitor_interval : 5 ray/0 (CommandActor pid=494377) log_dir : None ray/0 (CommandActor pid=494377) metrics_cfg : {} ray/0 (CommandActor pid=494377) ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group ray/0 (CommandActor pid=494406) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: ray/0 (CommandActor pid=494406) entrypoint : scripts.compute_world_size ray/0 (CommandActor pid=494406) min_nodes : 2 ray/0 (CommandActor pid=494406) max_nodes : 2 ray/0 (CommandActor pid=494406) nproc_per_node : 1 ray/0 (CommandActor pid=494406) run_id : compute_world_size-mpr03nzqvvg3td ray/0 (CommandActor pid=494406) rdzv_backend : c10d ray/0 (CommandActor pid=494406) rdzv_endpoint : 172.26.20.254:29500 ray/0 (CommandActor pid=494406) rdzv_configs : {'timeout': 900} ray/0 (CommandActor pid=494406) max_restarts : 0 ray/0 (CommandActor pid=494406) monitor_interval : 5 ray/0 (CommandActor pid=494406) log_dir : None ray/0 (CommandActor pid=494406) metrics_cfg : {} ray/0 (CommandActor pid=494406) ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: ray/0 (CommandActor pid=494377) restart_count=0 ray/0 (CommandActor pid=494377) master_addr=tristanr-arch2 ray/0 (CommandActor pid=494377) master_port=48089 ray/0 (CommandActor pid=494377) group_rank=1 ray/0 (CommandActor pid=494377) group_world_size=2 ray/0 (CommandActor pid=494377) local_ranks=[0] ray/0 (CommandActor pid=494377) role_ranks=[1] ray/0 (CommandActor pid=494377) global_ranks=[1] ray/0 (CommandActor pid=494377) role_world_sizes=[2] ray/0 (CommandActor pid=494377) global_world_sizes=[2] ray/0 (CommandActor pid=494377) ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t/attempt_0/0/error.json ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result: ray/0 (CommandActor pid=494406) restart_count=0 ray/0 (CommandActor pid=494406) master_addr=tristanr-arch2 ray/0 (CommandActor pid=494406) master_port=48089 ray/0 (CommandActor pid=494406) group_rank=0 ray/0 (CommandActor pid=494406) group_world_size=2 ray/0 (CommandActor pid=494406) local_ranks=[0] ray/0 (CommandActor pid=494406) role_ranks=[0] ray/0 (CommandActor pid=494406) global_ranks=[0] ray/0 (CommandActor pid=494406) role_world_sizes=[2] ray/0 (CommandActor pid=494406) global_world_sizes=[2] ray/0 (CommandActor pid=494406) ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p/attempt_0/0/error.json ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish. ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.000942230224609375 seconds ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish. ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0013003349304199219 seconds ray/0 (CommandActor pid=494377) [0]:initializing `gloo` process group ray/0 (CommandActor pid=494377) [0]:successfully initialized process group ray/0 (CommandActor pid=494377) [0]:rank: 1, actual world_size: 2, computed world_size: 2 ray/0 (CommandActor pid=494406) [0]:initializing `gloo` process group ray/0 (CommandActor pid=494406) [0]:successfully initialized process group ray/0 (CommandActor pid=494406) [0]:rank: 0, actual world_size: 2, computed world_size: 2 ray/0 running ray.wait on [ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)] Reviewed By: kiukchung, msaroufim Differential Revision: D36500237 Pulled By: d4l3k fbshipit-source-id: 9ecf85b7860a7220262f0146890012cc88630cd2
1 parent d0392f1 commit cb1d04a

File tree

8 files changed

+153
-51
lines changed

8 files changed

+153
-51
lines changed

.torchxignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.dockerignore

scripts/component_integration_tests.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,9 @@ def main() -> None:
101101
],
102102
"image": torchx_image,
103103
"cfg": {
104-
"working_dir": ".",
104+
"requirements": "",
105105
},
106+
"workspace": f"file://{os.getcwd()}",
106107
},
107108
}
108109

@@ -115,6 +116,7 @@ def main() -> None:
115116
image=params["image"],
116117
cfg=params["cfg"],
117118
dryrun=dryrun,
119+
workspace=params.get("workspace"),
118120
)
119121

120122

torchx/components/integration_tests/integ_tests.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from dataclasses import asdict
1111
from json import dumps
1212
from types import ModuleType
13-
from typing import Callable, cast, Dict, List, Type
13+
from typing import Callable, cast, Dict, List, Optional, Type
1414

1515
from pyre_extensions import none_throws
1616
from torchx.cli.cmd_log import get_logs
@@ -45,6 +45,7 @@ def run_components(
4545
scheduler: str,
4646
cfg: Dict[str, CfgVal],
4747
dryrun: bool = False,
48+
workspace: Optional[str] = None,
4849
) -> None:
4950
component_providers = [
5051
cast(Callable[..., ComponentProvider], cls)(scheduler, image)
@@ -68,7 +69,9 @@ def run_components(
6869
log.info(f"Submitting AppDef... (dryrun={dryrun})")
6970
# get the dryrun info to log the scheduler request
7071
# then use the schedule (intead of the run API) for job submission
71-
dryrun_info = runner.dryrun(app_def, scheduler, cfg=cfg)
72+
dryrun_info = runner.dryrun(
73+
app_def, scheduler, cfg=cfg, workspace=workspace
74+
)
7275
log.info(f"\nAppDef:\n{dumps(asdict(app_def), indent=4)}")
7376
log.info(f"\nScheduler Request:\n{dryrun_info}")
7477

torchx/schedulers/ray_scheduler.py

+62-40
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,26 @@
88
import json
99
import logging
1010
import os
11+
import tempfile
1112
import time
1213
from dataclasses import dataclass, field
1314
from datetime import datetime
14-
from shutil import copy2, copytree, rmtree
15-
from tempfile import mkdtemp
16-
from typing import Any, cast, Dict, List, Mapping, Optional, Set, Type # noqa
15+
from shutil import copy2, rmtree
16+
from typing import Any, cast, Dict, Iterable, List, Mapping, Optional, Set, Type # noqa
1717

1818
from torchx.schedulers.api import (
1919
AppDryRunInfo,
2020
AppState,
2121
DescribeAppResponse,
22+
filter_regex,
2223
Scheduler,
2324
split_lines,
2425
Stream,
2526
)
2627
from torchx.schedulers.ids import make_unique
2728
from torchx.schedulers.ray.ray_common import RayActor, TORCHX_RANK0_HOST
2829
from torchx.specs import AppDef, macros, NONE, ReplicaStatus, Role, RoleStatus, runopts
30+
from torchx.workspace.dir_workspace import TmpDirWorkspace
2931
from typing_extensions import TypedDict
3032

3133

@@ -92,7 +94,7 @@ class RayJob:
9294
dashboard_address:
9395
The existing dashboard IP address to connect to
9496
working_dir:
95-
The working directory to copy to the cluster
97+
The working directory to copy to the cluster
9698
requirements:
9799
The libraries to install on the cluster per requirements.txt
98100
actors:
@@ -102,15 +104,24 @@ class RayJob:
102104
"""
103105

104106
app_id: str
107+
working_dir: str
105108
cluster_config_file: Optional[str] = None
106109
cluster_name: Optional[str] = None
107110
dashboard_address: Optional[str] = None
108-
working_dir: Optional[str] = None
109111
requirements: Optional[str] = None
110112
actors: List[RayActor] = field(default_factory=list)
111113

112-
class RayScheduler(Scheduler[RayOpts]):
114+
class RayScheduler(Scheduler[RayOpts], TmpDirWorkspace):
113115
"""
116+
RayScheduler is a TorchX scheduling interface to Ray. The job def
117+
workers will be launched as Ray actors
118+
119+
The job environment is specified by the TorchX workspace. Any files in
120+
the workspace will be present in the Ray job unless specified in
121+
``.torchxignore``. Python dependencies will be read from the
122+
``requirements.txt`` file located at the root of the workspace unless
123+
it's overridden via ``-c ...,requirements=foo/requirements.txt``.
124+
114125
**Config Options**
115126
116127
.. runopts::
@@ -122,12 +133,15 @@ class RayScheduler(Scheduler[RayOpts]):
122133
type: scheduler
123134
features:
124135
cancel: true
125-
logs: true
136+
logs: |
137+
Partial support. Ray only supports a single log stream so
138+
only a dummy "ray/0" combined log role is supported.
139+
Tailing and time seeking are not supported.
126140
distributed: true
127141
describe: |
128142
Partial support. RayScheduler will return job status but
129143
does not provide the complete original AppSpec.
130-
workspaces: false
144+
workspaces: true
131145
mounts: false
132146
133147
"""
@@ -156,11 +170,6 @@ def run_opts(self) -> runopts:
156170
default="127.0.0.1:8265",
157171
help="Use ray status to get the dashboard address you will submit jobs against",
158172
)
159-
opts.add(
160-
"working_dir",
161-
type_=str,
162-
help="Copy the the working directory containing the Python scripts to the cluster.",
163-
)
164173
opts.add("requirements", type_=str, help="Path to requirements.txt")
165174
return opts
166175

@@ -169,7 +178,7 @@ def schedule(self, dryrun_info: AppDryRunInfo[RayJob]) -> str:
169178

170179
# Create serialized actors for ray_driver.py
171180
actors = cfg.actors
172-
dirpath = mkdtemp()
181+
dirpath = cfg.working_dir
173182
serialize(actors, dirpath)
174183

175184
job_submission_addr: str = ""
@@ -189,41 +198,46 @@ def schedule(self, dryrun_info: AppDryRunInfo[RayJob]) -> str:
189198
f"http://{job_submission_addr}"
190199
)
191200

192-
# 1. Copy working directory
193-
if cfg.working_dir:
194-
copytree(cfg.working_dir, dirpath, dirs_exist_ok=True)
195-
196-
# 2. Copy Ray driver utilities
201+
# 1. Copy Ray driver utilities
197202
current_directory = os.path.dirname(os.path.abspath(__file__))
198203
copy2(os.path.join(current_directory, "ray", "ray_driver.py"), dirpath)
199204
copy2(os.path.join(current_directory, "ray", "ray_common.py"), dirpath)
200205

201-
# 3. Parse requirements.txt
202-
reqs: List[str] = []
203-
if cfg.requirements: # pragma: no cover
204-
with open(cfg.requirements) as f:
205-
for line in f:
206-
reqs.append(line.strip())
206+
runtime_env = {"working_dir": dirpath}
207+
if cfg.requirements:
208+
runtime_env["pip"] = cfg.requirements
207209

208-
# 4. Submit Job via the Ray Job Submission API
210+
# 1. Submit Job via the Ray Job Submission API
209211
try:
210212
job_id: str = client.submit_job(
211213
job_id=cfg.app_id,
212214
# we will pack, hash, zip, upload, register working_dir in GCS of ray cluster
213215
# and use it to configure your job execution.
214216
entrypoint="python3 ray_driver.py",
215-
runtime_env={"working_dir": dirpath, "pip": reqs},
217+
runtime_env=runtime_env,
216218
)
217219

218220
finally:
219-
rmtree(dirpath)
221+
if dirpath.startswith(tempfile.gettempdir()):
222+
rmtree(dirpath)
220223

221224
# Encode job submission client in job_id
222225
return f"{job_submission_addr}-{job_id}"
223226

224227
def _submit_dryrun(self, app: AppDef, cfg: RayOpts) -> AppDryRunInfo[RayJob]:
225228
app_id = make_unique(app.name)
226-
requirements = cfg.get("requirements")
229+
230+
working_dir = app.roles[0].image
231+
if not os.path.exists(working_dir):
232+
raise RuntimeError(
233+
f"Role image must be a valid directory, got: {working_dir} "
234+
)
235+
236+
requirements: Optional[str] = cfg.get("requirements")
237+
if requirements is None:
238+
workspace_reqs = os.path.join(working_dir, "requirements.txt")
239+
if os.path.exists(workspace_reqs):
240+
requirements = workspace_reqs
227241

228242
cluster_cfg = cfg.get("cluster_config_file")
229243
if cluster_cfg:
@@ -234,8 +248,9 @@ def _submit_dryrun(self, app: AppDef, cfg: RayOpts) -> AppDryRunInfo[RayJob]:
234248

235249
job: RayJob = RayJob(
236250
app_id,
237-
cluster_cfg,
251+
cluster_config_file=cluster_cfg,
238252
requirements=requirements,
253+
working_dir=working_dir,
239254
)
240255

241256
else: # pragma: no cover
@@ -244,9 +259,9 @@ def _submit_dryrun(self, app: AppDef, cfg: RayOpts) -> AppDryRunInfo[RayJob]:
244259
app_id=app_id,
245260
dashboard_address=dashboard_address,
246261
requirements=requirements,
262+
working_dir=working_dir,
247263
)
248264
job.cluster_name = cfg.get("cluster_name")
249-
job.working_dir = cfg.get("working_dir")
250265

251266
for role in app.roles:
252267
for replica_id in range(role.num_replicas):
@@ -298,12 +313,10 @@ def wait_until_finish(self, app_id: str, timeout: int = 30) -> None:
298313
with a given timeout. This is intended for testing. Programmatic
299314
usage should use the runner wait method instead.
300315
"""
301-
addr, _, app_id = app_id.partition("-")
302316

303-
client = JobSubmissionClient(f"http://{addr}")
304317
start = time.time()
305318
while time.time() - start <= timeout:
306-
status_info = client.get_job_status(app_id)
319+
status_info = self._get_job_status(app_id)
307320
status = status_info
308321
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
309322
break
@@ -314,12 +327,18 @@ def _cancel_existing(self, app_id: str) -> None: # pragma: no cover
314327
client = JobSubmissionClient(f"http://{addr}")
315328
client.stop_job(app_id)
316329

317-
def describe(self, app_id: str) -> Optional[DescribeAppResponse]:
330+
def _get_job_status(self, app_id: str) -> JobStatus:
318331
addr, _, app_id = app_id.partition("-")
319332
client = JobSubmissionClient(f"http://{addr}")
320-
job_status_info = client.get_job_status(app_id)
333+
status = client.get_job_status(app_id)
334+
if isinstance(status, str):
335+
return cast(JobStatus, status)
336+
return status.status
337+
338+
def describe(self, app_id: str) -> Optional[DescribeAppResponse]:
339+
job_status_info = self._get_job_status(app_id)
321340
state = _ray_status_to_torchx_appstate[job_status_info]
322-
roles = [Role(name="ray", num_replicas=-1, image="<N/A>")]
341+
roles = [Role(name="ray", num_replicas=1, image="<N/A>")]
323342

324343
# get ip_address and put it in hostname
325344

@@ -354,12 +373,15 @@ def log_iter(
354373
until: Optional[datetime] = None,
355374
should_tail: bool = False,
356375
streams: Optional[Stream] = None,
357-
) -> List[str]:
358-
# TODO: support regex, tailing, streams etc..
376+
) -> Iterable[str]:
377+
# TODO: support tailing, streams etc..
359378
addr, _, app_id = app_id.partition("-")
360379
client: JobSubmissionClient = JobSubmissionClient(f"http://{addr}")
361380
logs: str = client.get_job_logs(app_id)
362-
return split_lines(logs)
381+
iterator = split_lines(logs)
382+
if regex:
383+
return filter_regex(regex, iterator)
384+
return iterator
363385

364386
def create_scheduler(session_name: str, **kwargs: Any) -> RayScheduler:
365387
if not has_ray(): # pragma: no cover

torchx/schedulers/test/.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
actors.json
2+
ray_common.py
3+
ray_driver.py

0 commit comments

Comments
 (0)