Skip to content

Conversation

@zpoint
Copy link
Collaborator

@zpoint zpoint commented Nov 18, 2025

When we run sky serve logs xxx, we expand _SKYPILOT_PROVISION_LOG_CMD_PATTERN with the actual provision log file in the replica launch log. But when a Kubernetes cluster needs to pull a Docker image, there could be hundreds of lines like:

<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>

that match the pattern because we're waiting for the image pull. This causes the log expansion to happen hundreds of times.

Then sky serve logs xxx streams a huge file (30M+) with the provision logs repeated 100+ times, making users think it's stuck in an infinite loop.

Reproduce

1: sky serve up with an image thats not exists in your k8s clusters

cat config.yaml
service:
  readiness_probe:
    path: /health
    initial_delay_seconds: 3600
  replicas: 2

resources:
  image_id: docker:lmsysorg/sglang:v0.5.4
  # accelerators: H200:1
  infra: k8s
  ports: 8000
  # disk_size: 500

run: |
  set -euo pipefail
  python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-0.6B \
    --host 0.0.0.0 \
    --port 8000 \
    2>&1 | tee ~/sglang.log &

sky serve up config.yaml
(sky) zpoint:~/codes/skypilot (master)$ sky status
(sky) zpoint:~/codes/skypilot (master)$ sky status
Enabled Infra: kubernetes/kind-skypilot, aws, azure

Clusters
NAME                           INFRA                       RESOURCES               STATUS  AUTOSTOP    LAUNCHED     
sky-serve-controller-838c6a5f  Kubernetes (kind-skypilot)  1x(cpus=4, mem=4, ...)  UP      10m (down)  15 mins ago  

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
NAME              VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT                                                                              
sky-service-4528  -        -       NO_REPLICA  0/2       http://localhost:30090/skypilot/default/sky-serve-controller-838c6a5f-838c6a5f/30001  

Service Replicas
SERVICE_NAME      ID  VERSION  ENDPOINT                                                                  LAUNCHED     INFRA                       RESOURCES               STATUS    
sky-service-4528  1   1        http://localhost:30090/skypilot/default/sky-service-4528-1-838c6a5f/8000  48 secs ago  Kubernetes (kind-skypilot)  1x(cpus=2, mem=2, ...)  STARTING  
sky-service-4528  2   1        http://localhost:30090/skypilot/default/sky-service-4528-2-838c6a5f/8000  45 secs ago  Kubernetes (kind-skypilot)  1x(cpus=2, mem=2, ...)  STARTING  

* To see detailed service status: sky serve status -v
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh
(sky) zpoint:~/codes/skypilot (master)$ 
(sky) zpoint:~/codes/skypilot (master)$ 

2: ssh into the controller and see the replica launch logs

ssh sky-serve-controller-838c6a5f
(base) sky@sky-serve-controller-838c6a5f-838c6a5f-head:~$ head -30 ~/.sky/serve/sky_service_4528/replica_1_launch.log
I 11-18 10:23:02 replica_managers.py:92] Launching replica (id: 1) cluster sky-service-4528-1 with resources: {Kubernetes(image_id=docker:lmsysorg/sglang:v0.5.4, ports=['8000'])}
<sky-payload>"<rich_init>[bold cyan]Optimizing[/]</rich_init>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Optimizing[/]</rich_update>"</sky-payload>
<sky-payload>"<rich_start>[bold cyan]Optimizing[/]</rich_start>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Checking infra choices...[/]</rich_update>"</sky-payload>
<sky-payload>"<rich_start>[bold cyan]Optimizing[/]</rich_start>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Optimizing[/]</rich_update>"</sky-payload>
I 11-18 10:23:15 optimizer.py:770] Estimated cost: $0.0 / hour
I 11-18 10:23:15 optimizer.py:770] 
I 11-18 10:23:15 optimizer.py:918] Considered resources (1 node):
I 11-18 10:23:15 optimizer.py:986] --------------------------------------------------------------------------------------
I 11-18 10:23:15 optimizer.py:986]  INFRA                        INSTANCE   vCPUs   Mem(GB)   GPUS   COST ($)   CHOSEN   
I 11-18 10:23:15 optimizer.py:986] --------------------------------------------------------------------------------------
I 11-18 10:23:15 optimizer.py:986]  Kubernetes (kind-skypilot)   -          2       2         -      0.00          ✔     
I 11-18 10:23:15 optimizer.py:986] --------------------------------------------------------------------------------------
<sky-payload>"<rich_exit></rich_exit>"</sky-payload>
<sky-payload>"<rich_init>[bold cyan]Launching[/]</rich_init>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching[/]</rich_update>"</sky-payload>
<sky-payload>"<rich_start>[bold cyan]Launching[/]</rich_start>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching[/]</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
I 11-18 10:23:18 cloud_vm_ray_backend.py:1745] ⚙︎ Launching on Kubernetes.
<sky-payload>"<rich_update>[bold cyan]Launching[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
<sky-payload>"<rich_update>[bold cyan]Launching (1 pod(s) pending due to Pulling)[/]  \u001b[2mView logs: \u001b[1msky logs --provision sky-service-4528-1\u001b[0m\u001b[0m</rich_update>"</sky-payload>
...

(base) sky@sky-serve-controller-838c6a5f-838c6a5f-head:~$ grep -c "sky logs --provision" ~/.sky/serve/sky_service_4528/replica_1_launch.log
735

We found 735 repeated sky logs --provision here, in slow network, this could be 1000+ lines

3: Repeated sky serve log

No matter you run sky serve logs sky-service-4528 1 --sync-down / sky serve logs sky-service-4528 1 / sky serve logs sky-service-4528 1 --no-follow, you will see the provision logs repeats N times, in my case the N is 735

So the problem is that sky serve logs expands the sky logs --provision line with the provision log file, and the rich_util makes this line appear on multiple lines when waiting for docker pull, causing the provision to be redirected N times in the final result.

sky serve logs sky-service-4528 1 --sync-down

(sky) zpoint:~/codes/skypilot (master)$ grep -c "==================== Provisioning ====================" /home/zpoint/sky_logs/service/sky-service-4528_sky-2025-11-18-18-40-12-312695/replica-1.log 
735

Test

Manually switch to this branch and run again, only 1 provision log rather than 700+

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@zpoint zpoint marked this pull request as draft November 18, 2025 10:09
@zpoint
Copy link
Collaborator Author

zpoint commented Nov 18, 2025

/smoke-test --serve
/smoke-test --serve --kubernetes

@zpoint zpoint marked this pull request as ready for review November 18, 2025 10:49
@zpoint
Copy link
Collaborator Author

zpoint commented Nov 18, 2025

/smoke-test --serve
/smoke-test --serve --kubernetes

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 18, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @zpoint! Can we add a smoke/unit test for this? Wondering how the regression was caused.

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 19, 2025

/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 19, 2025

/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 19, 2025

/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates
/smoke-test --serve --kubernetes

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 19, 2025

/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 19, 2025

/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 19, 2025

/smoke-test --serve --kubernetes -k test_skyserve_log_expansion_no_duplicates

@zpoint zpoint merged commit fa4839c into skypilot-org:master Nov 19, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants