-
Notifications
You must be signed in to change notification settings - Fork 855
Avoid duplicated log expansion for sky serve logs
#8002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid duplicated log expansion for sky serve logs
#8002
Conversation
|
/smoke-test --serve |
|
/smoke-test --serve |
|
/quicktest-core |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch @zpoint! Can we add a smoke/unit test for this? Wondering how the regression was caused.
|
/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates |
|
/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates |
|
/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates |
|
/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates |
|
/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates |
|
/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates |
When we run
sky serve logs xxx, we expand_SKYPILOT_PROVISION_LOG_CMD_PATTERNwith the actual provision log file in the replica launch log. But when a Kubernetes cluster needs to pull a Docker image, there could be hundreds of lines like:that match the pattern because we're waiting for the image pull. This causes the log expansion to happen hundreds of times.
Then
sky serve logs xxxstreams a huge file (30M+) with the provision logs repeated 100+ times, making users think it's stuck in an infinite loop.Reproduce
1:
sky serve upwith an image thats not exists in your k8s clusterscat config.yaml service: readiness_probe: path: /health initial_delay_seconds: 3600 replicas: 2 resources: image_id: docker:lmsysorg/sglang:v0.5.4 # accelerators: H200:1 infra: k8s ports: 8000 # disk_size: 500 run: | set -euo pipefail python3 -m sglang.launch_server \ --model-path Qwen/Qwen3-0.6B \ --host 0.0.0.0 \ --port 8000 \ 2>&1 | tee ~/sglang.log & sky serve up config.yaml (sky) zpoint:~/codes/skypilot (master)$ sky status (sky) zpoint:~/codes/skypilot (master)$ sky status Enabled Infra: kubernetes/kind-skypilot, aws, azure Clusters NAME INFRA RESOURCES STATUS AUTOSTOP LAUNCHED sky-serve-controller-838c6a5f Kubernetes (kind-skypilot) 1x(cpus=4, mem=4, ...) UP 10m (down) 15 mins ago Managed jobs No in-progress managed jobs. (See: sky jobs -h) Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-4528 - - NO_REPLICA 0/2 http://localhost:30090/skypilot/default/sky-serve-controller-838c6a5f-838c6a5f/30001 Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED INFRA RESOURCES STATUS sky-service-4528 1 1 http://localhost:30090/skypilot/default/sky-service-4528-1-838c6a5f/8000 48 secs ago Kubernetes (kind-skypilot) 1x(cpus=2, mem=2, ...) STARTING sky-service-4528 2 1 http://localhost:30090/skypilot/default/sky-service-4528-2-838c6a5f/8000 45 secs ago Kubernetes (kind-skypilot) 1x(cpus=2, mem=2, ...) STARTING * To see detailed service status: sky serve status -v * 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh (sky) zpoint:~/codes/skypilot (master)$ (sky) zpoint:~/codes/skypilot (master)$2: ssh into the controller and see the replica launch logs
We found 735 repeated
sky logs --provisionhere, in slow network, this could be 1000+ lines3: Repeated sky serve log
No matter you run
sky serve logs sky-service-4528 1 --sync-down/sky serve logs sky-service-4528 1/sky serve logs sky-service-4528 1 --no-follow, you will see the provision logs repeats N times, in my case the N is 735So the problem is that
sky serve logsexpands thesky logs --provisionline with the provision log file, and therich_utilmakes this line appear on multiple lines when waiting for docker pull, causing the provision to be redirected N times in the final result.Test
Manually switch to this branch and run again, only 1 provision log rather than 700+
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)