-
Notifications
You must be signed in to change notification settings - Fork 20
Update skypilot, use async api #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking on this update!
This may be an issue with the new skypilot version (or an unrelated update), but when you launch a cluster through the SkyPilotAPI, do you see messages like these?
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
) | ||
if ( | ||
len(cluster_status) == 0 | ||
or cluster_status[0]["status"] != sky.ClusterStatus.UP | ||
): | ||
await self._launch_cluster(resources, art_version, env_path) | ||
launch_job_id = await self._launch_cluster(resources, art_version, env_path) | ||
await asyncio.to_thread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to use the to_thread_typed
helper for all our sync -> async conversions in this file. I realize that in the case of tail_logs
, to_thread
already properly returns the type, but I actually prefer the to_thread_typed
syntax personally.
@@ -34,7 +36,7 @@ async def initialize_cluster( | |||
) -> None: | |||
self = cls.__new__(cls) | |||
self._cluster_name = cluster_name | |||
|
|||
self._job_id = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename this to _art_server_job_id
? We may have more tasks in the future, and I can imagine myself forgetting whether this id refers to the task of launching the cluster as a whole or the art server in particular.
@@ -163,9 +186,13 @@ async def _launch_cluster( | |||
print(task) | |||
|
|||
try: | |||
await to_thread_typed( | |||
lambda: sky.launch(task=task, cluster_name=self._cluster_name) | |||
job_id, handler = await to_thread_typed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use handler anywhere?
) | ||
|
||
return job_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning job_id here, would it make sense to start tailing logs within _launch_cluster? That would reduce the amount of code in the main initialize_cluster function, and seems like it should be part of the launching process.
uv.lock
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to run uv sync
on a linux machine. We've seen weird bugs from syncing on macbooks then running the backend on a linux machine.
Skypilot switched to a new client-server architecture where all client methods are async. Now we have a SkyPilot client and a SkyPilot server running on the same machine. The client just starts jobs, and the server executes them in the background. If we want to see logs, we need to capture them manually. The old version does not support new features, so we have to update.
Also, we have updated the SkyPilot catalog format, which lists available GPUs and regions on RunPod. Since it is hosted as a separate package, it may not be compatible with SkyPilot v0.8.0. At least I have this issue on my computer.