Skip to content

Update skypilot, use async api #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

Kovbo
Copy link

@Kovbo Kovbo commented Apr 29, 2025

Skypilot switched to a new client-server architecture where all client methods are async. Now we have a SkyPilot client and a SkyPilot server running on the same machine. The client just starts jobs, and the server executes them in the background. If we want to see logs, we need to capture them manually. The old version does not support new features, so we have to update.

Also, we have updated the SkyPilot catalog format, which lists available GPUs and regions on RunPod. Since it is hosted as a separate package, it may not be compatible with SkyPilot v0.8.0. At least I have this issue on my computer.

@Kovbo Kovbo marked this pull request as draft April 29, 2025 03:33
@Kovbo Kovbo changed the title Update skypilot Update skypilot, use async api Apr 30, 2025
@Kovbo Kovbo marked this pull request as ready for review April 30, 2025 00:57
Copy link
Contributor

@arcticfly arcticfly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this update!

This may be an issue with the new skypilot version (or an unrelated update), but when you launch a cluster through the SkyPilotAPI, do you see messages like these?

WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: [Errno 64] Host is down
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.

)
if (
len(cluster_status) == 0
or cluster_status[0]["status"] != sky.ClusterStatus.UP
):
await self._launch_cluster(resources, art_version, env_path)
launch_job_id = await self._launch_cluster(resources, art_version, env_path)
await asyncio.to_thread(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use the to_thread_typed helper for all our sync -> async conversions in this file. I realize that in the case of tail_logs, to_thread already properly returns the type, but I actually prefer the to_thread_typed syntax personally.

@@ -34,7 +36,7 @@ async def initialize_cluster(
) -> None:
self = cls.__new__(cls)
self._cluster_name = cluster_name

self._job_id = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this to _art_server_job_id? We may have more tasks in the future, and I can imagine myself forgetting whether this id refers to the task of launching the cluster as a whole or the art server in particular.

@@ -163,9 +186,13 @@ async def _launch_cluster(
print(task)

try:
await to_thread_typed(
lambda: sky.launch(task=task, cluster_name=self._cluster_name)
job_id, handler = await to_thread_typed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use handler anywhere?

)

return job_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning job_id here, would it make sense to start tailing logs within _launch_cluster? That would reduce the amount of code in the main initialize_cluster function, and seems like it should be part of the launching process.

uv.lock Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to run uv sync on a linux machine. We've seen weird bugs from syncing on macbooks then running the backend on a linux machine.

@Kovbo Kovbo marked this pull request as draft May 1, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants