feat: add /ready endpoint for proper Kubernetes readiness checks#1810
feat: add /ready endpoint for proper Kubernetes readiness checks#1810
Conversation
The existing /alive and /health endpoints return 200 immediately when the server process starts, but the server may still be initializing services (VSCode, desktop, tool preload, etc.). This caused issues where Kubernetes marked pods as 'ready' before they were actually ready to serve requests, leading to health check failures during the initialization window. Changes: - Add /ready endpoint that returns 503 until initialization is complete - Add mark_initialization_complete() to signal when services are ready - Call mark_initialization_complete() after all services have started The /ready endpoint should be used by Kubernetes readiness probes: - Returns 503 during initialization (server not ready for traffic) - Returns 200 after all services have initialized This allows the existing /alive endpoint to be used for liveness probes (is the process running?) while /ready is used for readiness probes (is the server ready to serve requests?). Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
This PR adds a useful /ready endpoint for Kubernetes readiness checks, but there is a critical bug where the server marks itself as ready even if service initialization fails. See inline comments for details.
openhands-agent-server/openhands/agent_server/server_details_router.py
Outdated
Show resolved
Hide resolved
openhands-agent-server/openhands/agent_server/server_details_router.py
Outdated
Show resolved
Hide resolved
|
@OpenHands see the all-hands-bot review comments, and my responses to them. Fix according to the directions in my responses. |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
I've addressed all the review comments from all-hands-bot according to neubig's directions. Here's a summary: Changes Made (already pushed to
|
66411b8 to
2de0407
Compare
- Check asyncio.gather results for exceptions before calling mark_initialization_complete() - Raise RuntimeError to prevent server from starting if service initialization fails with exceptions - Remove unused is_ready() function Related issues created: - #1825: Consider using asyncio.Event() for thread-safe initialization state - #1826: Standardize response format across health check endpoints Co-authored-by: openhands <openhands@all-hands.dev>
2de0407 to
e653488
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
The implementation looks solid and follows Kubernetes best practices for separating liveness and readiness probes. Just a couple minor type hint suggestions for consistency.
openhands-agent-server/openhands/agent_server/server_details_router.py
Outdated
Show resolved
Hide resolved
openhands-agent-server/openhands/agent_server/server_details_router.py
Outdated
Show resolved
Hide resolved
…outer.py Co-authored-by: OpenHands Bot <contact@all-hands.dev>
…outer.py Co-authored-by: OpenHands Bot <contact@all-hands.dev>
|
[Automatic Post]: This PR seems to be currently waiting for review. @tofarr @simonrosenberg, could you please take a look when you have a chance? |
tofarr
left a comment
There was a problem hiding this comment.
I think we should also deprecate the existing methods given that they do not perform correctly
Problem
The existing
/aliveand/healthendpoints return 200 immediately when the server process starts, but the server may still be initializing services (VSCode, desktop, tool preload, etc.).This causes issues where Kubernetes marks pods as "ready" before the application is actually ready to serve requests, leading to:
Solution
Add a
/readyendpoint that returns:This follows the Kubernetes best practice of separating:
/alive,/health) - Is the process running?/ready) - Is the server ready to receive traffic?Changes
server_details_router.py_initialization_completeflag (defaultFalse)mark_initialization_complete()function to set the flagis_ready()function to check the flag/readyendpoint that checks initialization statusapi.pymark_initialization_completeUsage
This endpoint should be used by Kubernetes readiness probes:
Related PRs
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:bfc0125-pythonRun
All tags pushed for this build
About Multi-Architecture Support
bfc0125-python) is a multi-arch manifest supporting both amd64 and arm64bfc0125-python-amd64) are also available if needed