Skip to content

feat: add /ready endpoint for proper Kubernetes readiness checks#1810

Merged
neubig merged 7 commits intomainfrom
fix-readiness-endpoint
Feb 6, 2026
Merged

feat: add /ready endpoint for proper Kubernetes readiness checks#1810
neubig merged 7 commits intomainfrom
fix-readiness-endpoint

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 24, 2026

Problem

The existing /alive and /health endpoints return 200 immediately when the server process starts, but the server may still be initializing services (VSCode, desktop, tool preload, etc.).

This causes issues where Kubernetes marks pods as "ready" before the application is actually ready to serve requests, leading to:

  • 503 errors during initial startup
  • 503 errors during container restarts (while new container initializes)

Solution

Add a /ready endpoint that returns:

  • 503 Service Unavailable during initialization
  • 200 OK after all services have finished initializing

This follows the Kubernetes best practice of separating:

  • Liveness probes (/alive, /health) - Is the process running?
  • Readiness probes (/ready) - Is the server ready to receive traffic?

Changes

server_details_router.py

  • Add _initialization_complete flag (default False)
  • Add mark_initialization_complete() function to set the flag
  • Add is_ready() function to check the flag
  • Add /ready endpoint that checks initialization status

api.py

  • Import mark_initialization_complete
  • Call it after all async initialization tasks complete

Usage

This endpoint should be used by Kubernetes readiness probes:

readinessProbe:
  httpGet:
    path: /ready
    port: 60000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Related PRs


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bfc0125-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bfc0125-python \
  ghcr.io/openhands/agent-server:bfc0125-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bfc0125-golang-amd64
ghcr.io/openhands/agent-server:bfc0125-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bfc0125-golang-arm64
ghcr.io/openhands/agent-server:bfc0125-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bfc0125-java-amd64
ghcr.io/openhands/agent-server:bfc0125-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bfc0125-java-arm64
ghcr.io/openhands/agent-server:bfc0125-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bfc0125-python-amd64
ghcr.io/openhands/agent-server:bfc0125-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:bfc0125-python-arm64
ghcr.io/openhands/agent-server:bfc0125-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:bfc0125-golang
ghcr.io/openhands/agent-server:bfc0125-java
ghcr.io/openhands/agent-server:bfc0125-python

About Multi-Architecture Support

  • Each variant tag (e.g., bfc0125-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., bfc0125-python-amd64) are also available if needed

The existing /alive and /health endpoints return 200 immediately when the
server process starts, but the server may still be initializing services
(VSCode, desktop, tool preload, etc.).

This caused issues where Kubernetes marked pods as 'ready' before they
were actually ready to serve requests, leading to health check failures
during the initialization window.

Changes:
- Add /ready endpoint that returns 503 until initialization is complete
- Add mark_initialization_complete() to signal when services are ready
- Call mark_initialization_complete() after all services have started

The /ready endpoint should be used by Kubernetes readiness probes:
- Returns 503 during initialization (server not ready for traffic)
- Returns 200 after all services have initialized

This allows the existing /alive endpoint to be used for liveness probes
(is the process running?) while /ready is used for readiness probes
(is the server ready to serve requests?).

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 24, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1810 at branch `fix-readiness-endpoint`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review January 24, 2026 05:36
@neubig neubig requested a review from tofarr January 24, 2026 05:36
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a useful /ready endpoint for Kubernetes readiness checks, but there is a critical bug where the server marks itself as ready even if service initialization fails. See inline comments for details.

@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands see the all-hands-bot review comments, and my responses to them. Fix according to the directions in my responses.

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I've addressed all the review comments from all-hands-bot according to neubig's directions. Here's a summary:

Changes Made (already pushed to fix-readiness-endpoint)

1. ✅ Fixed critical bug in api.py (neubig said: "Fix")

  • Now captures results from asyncio.gather() and checks for exceptions before calling mark_initialization_complete()
  • If any exceptions occurred during service initialization, the server remains in "initializing" state and logs the errors
  • This prevents the server from being marked as ready when initialization actually failed

2. ✅ Removed unused is_ready() function from server_details_router.py (neubig said: "Fix")

  • Removed the dead code that was never used

3. ✅ Created issue #1825 for asyncio.Event() suggestion (neubig said: "don't fix here but open an issue. Label it as 'good-first-issue'")

4. ✅ Created issue #1826 for response format inconsistency (neubig said: "Don't fix in this PR, but open an issue")

5. ❌ No action needed for async def suggestion (neubig said: "Deeming unnecessary for now, closing")

All pre-commit checks pass and changes have been pushed to the PR branch.

View full conversation

@neubig neubig force-pushed the fix-readiness-endpoint branch from 66411b8 to 2de0407 Compare January 26, 2026 13:21
- Check asyncio.gather results for exceptions before calling
  mark_initialization_complete()
- Raise RuntimeError to prevent server from starting if service
  initialization fails with exceptions
- Remove unused is_ready() function

Related issues created:
- #1825: Consider using asyncio.Event() for thread-safe initialization state
- #1826: Standardize response format across health check endpoints

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the fix-readiness-endpoint branch from 2de0407 to e653488 Compare January 26, 2026 13:26
@neubig neubig marked this pull request as ready for review January 26, 2026 13:28
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks solid and follows Kubernetes best practices for separating liveness and readiness probes. Just a couple minor type hint suggestions for consistency.

neubig and others added 2 commits January 26, 2026 08:32
…outer.py

Co-authored-by: OpenHands Bot <contact@all-hands.dev>
…outer.py

Co-authored-by: OpenHands Bot <contact@all-hands.dev>
@neubig neubig requested a review from simonrosenberg January 28, 2026 15:48
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: This PR seems to be currently waiting for review. @tofarr @simonrosenberg, could you please take a look when you have a chance?

@juanmichelini juanmichelini self-requested a review February 6, 2026 03:20
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also deprecate the existing methods given that they do not perform correctly

@neubig neubig merged commit 15d87e6 into main Feb 6, 2026
30 checks passed
@neubig neubig deleted the fix-readiness-endpoint branch February 6, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants