feat: add /ready endpoint for proper Kubernetes readiness checks by neubig · Pull Request #1810 · OpenHands/software-agent-sdk

neubig · 2026-01-24T04:45:40Z

Problem

The existing /alive and /health endpoints return 200 immediately when the server process starts, but the server may still be initializing services (VSCode, desktop, tool preload, etc.).

This causes issues where Kubernetes marks pods as "ready" before the application is actually ready to serve requests, leading to:

503 errors during initial startup
503 errors during container restarts (while new container initializes)

Solution

Add a /ready endpoint that returns:

503 Service Unavailable during initialization
200 OK after all services have finished initializing

This follows the Kubernetes best practice of separating:

Liveness probes (/alive, /health) - Is the process running?
Readiness probes (/ready) - Is the server ready to receive traffic?

Changes

`server_details_router.py`

Add _initialization_complete flag (default False)
Add mark_initialization_complete() function to set the flag
Add is_ready() function to check the flag
Add /ready endpoint that checks initialization status

`api.py`

Import mark_initialization_complete
Call it after all async initialization tasks complete

Usage

This endpoint should be used by Kubernetes readiness probes:

readinessProbe:
  httpGet:
    path: /ready
    port: 60000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Related PRs

https://github.com/OpenHands/runtime-api/pull/382 - Adds readiness probe configuration to use this endpoint

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bfc0125-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bfc0125-python \
  ghcr.io/openhands/agent-server:bfc0125-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bfc0125-golang-amd64
ghcr.io/openhands/agent-server:bfc0125-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bfc0125-golang-arm64
ghcr.io/openhands/agent-server:bfc0125-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bfc0125-java-amd64
ghcr.io/openhands/agent-server:bfc0125-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bfc0125-java-arm64
ghcr.io/openhands/agent-server:bfc0125-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bfc0125-python-amd64
ghcr.io/openhands/agent-server:bfc0125-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:bfc0125-python-arm64
ghcr.io/openhands/agent-server:bfc0125-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:bfc0125-golang
ghcr.io/openhands/agent-server:bfc0125-java
ghcr.io/openhands/agent-server:bfc0125-python

About Multi-Architecture Support

Each variant tag (e.g., bfc0125-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., bfc0125-python-amd64) are also available if needed

The existing /alive and /health endpoints return 200 immediately when the server process starts, but the server may still be initializing services (VSCode, desktop, tool preload, etc.). This caused issues where Kubernetes marked pods as 'ready' before they were actually ready to serve requests, leading to health check failures during the initialization window. Changes: - Add /ready endpoint that returns 503 until initialization is complete - Add mark_initialization_complete() to signal when services are ready - Call mark_initialization_complete() after all services have started The /ready endpoint should be used by Kubernetes readiness probes: - Returns 503 during initialization (server not ready for traffic) - Returns 200 after all services have initialized This allows the existing /alive endpoint to be used for liveness probes (is the process running?) while /ready is used for readiness probes (is the server ready to serve requests?). Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-24T04:51:16Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1810 at branch `fix-readiness-endpoint`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

This PR adds a useful /ready endpoint for Kubernetes readiness checks, but there is a critical bug where the server marks itself as ready even if service initialization fails. See inline comments for details.

openhands-agent-server/openhands/agent_server/api.py

openhands-agent-server/openhands/agent_server/server_details_router.py

neubig · 2026-01-26T12:39:06Z

@OpenHands see the all-hands-bot review comments, and my responses to them. Fix according to the directions in my responses.

openhands-ai · 2026-01-26T12:39:15Z

I'm on it! neubig can track my progress at all-hands.dev

openhands-ai · 2026-01-26T12:42:44Z

I've addressed all the review comments from all-hands-bot according to neubig's directions. Here's a summary:

Changes Made (already pushed to `fix-readiness-endpoint`)

1. ✅ Fixed critical bug in `api.py` (neubig said: "Fix")

Now captures results from asyncio.gather() and checks for exceptions before calling mark_initialization_complete()
If any exceptions occurred during service initialization, the server remains in "initializing" state and logs the errors
This prevents the server from being marked as ready when initialization actually failed

2. ✅ Removed unused `is_ready()` function from `server_details_router.py` (neubig said: "Fix")

Removed the dead code that was never used

3. ✅ Created issue #1825 for asyncio.Event() suggestion (neubig said: "don't fix here but open an issue. Label it as 'good-first-issue'")

Consider using asyncio.Event() for thread-safe initialization state #1825
Labeled with "good first issue" and "enhancement"

4. ✅ Created issue #1826 for response format inconsistency (neubig said: "Don't fix in this PR, but open an issue")

Standardize response format across health check endpoints #1826
Labeled with "enhancement"

5. ❌ No action needed for async def suggestion (neubig said: "Deeming unnecessary for now, closing")

All pre-commit checks pass and changes have been pushed to the PR branch.

View full conversation

- Check asyncio.gather results for exceptions before calling mark_initialization_complete() - Raise RuntimeError to prevent server from starting if service initialization fails with exceptions - Remove unused is_ready() function Related issues created: - #1825: Consider using asyncio.Event() for thread-safe initialization state - #1826: Standardize response format across health check endpoints Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

The implementation looks solid and follows Kubernetes best practices for separating liveness and readiness probes. Just a couple minor type hint suggestions for consistency.

openhands-agent-server/openhands/agent_server/server_details_router.py

…outer.py Co-authored-by: OpenHands Bot <contact@all-hands.dev>

all-hands-bot · 2026-02-01T12:19:23Z

[Automatic Post]: This PR seems to be currently waiting for review. @tofarr @simonrosenberg, could you please take a look when you have a chance?

openhands-agent-server/openhands/agent_server/server_details_router.py

tofarr

I think we should also deprecate the existing methods given that they do not perform correctly

openhands-agent-server/openhands/agent_server/server_details_router.py

fix: apply ruff formatting fixes

8bfc68b

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review January 24, 2026 05:36

neubig requested a review from tofarr January 24, 2026 05:36

all-hands-bot reviewed Jan 24, 2026

View reviewed changes

neubig marked this pull request as draft January 26, 2026 12:39

This was referenced Jan 26, 2026

Consider using asyncio.Event() for thread-safe initialization state #1825

Open

Standardize response format across health check endpoints #1826

Open

neubig force-pushed the fix-readiness-endpoint branch from 66411b8 to 2de0407 Compare January 26, 2026 13:21

neubig force-pushed the fix-readiness-endpoint branch from 2de0407 to e653488 Compare January 26, 2026 13:26

neubig marked this pull request as ready for review January 26, 2026 13:28

Merge branch 'main' into fix-readiness-endpoint

ff1ef7e

all-hands-bot reviewed Jan 26, 2026

View reviewed changes

openhands-agent-server/openhands/agent_server/server_details_router.py Outdated Show resolved Hide resolved

openhands-agent-server/openhands/agent_server/server_details_router.py Outdated Show resolved Hide resolved

neubig and others added 2 commits January 26, 2026 08:32

Update openhands-agent-server/openhands/agent_server/server_details_r…

c291da1

…outer.py Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Update openhands-agent-server/openhands/agent_server/server_details_r…

a7a5594

…outer.py Co-authored-by: OpenHands Bot <contact@all-hands.dev>

neubig requested a review from simonrosenberg January 28, 2026 15:48

Merge branch 'main' into fix-readiness-endpoint

1f6b2e6

juanmichelini self-requested a review February 6, 2026 03:20

tofarr reviewed Feb 6, 2026

View reviewed changes

openhands-agent-server/openhands/agent_server/server_details_router.py Show resolved Hide resolved

tofarr requested changes Feb 6, 2026

View reviewed changes

openhands-agent-server/openhands/agent_server/server_details_router.py Show resolved Hide resolved

tofarr approved these changes Feb 6, 2026

View reviewed changes

neubig merged commit 15d87e6 into main Feb 6, 2026
30 checks passed

neubig deleted the fix-readiness-endpoint branch February 6, 2026 12:01

neubig mentioned this pull request Feb 6, 2026

feat: add SANDBOX_STARTUP_GRACE_SECONDS env var for configurable startup timeout OpenHands/OpenHands#12741

Merged

Conversation

neubig commented Jan 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

server_details_router.py

api.py

Usage

Related PRs

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Changes Made (already pushed to fix-readiness-endpoint)

1. ✅ Fixed critical bug in api.py (neubig said: "Fix")

2. ✅ Removed unused is_ready() function from server_details_router.py (neubig said: "Fix")

3. ✅ Created issue #1825 for asyncio.Event() suggestion (neubig said: "don't fix here but open an issue. Label it as 'good-first-issue'")

4. ✅ Created issue #1826 for response format inconsistency (neubig said: "Don't fix in this PR, but open an issue")

5. ❌ No action needed for async def suggestion (neubig said: "Deeming unnecessary for now, closing")

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Feb 1, 2026

Uh oh!

Uh oh!

tofarr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

neubig commented Jan 24, 2026 •

edited by github-actions bot

Loading

`server_details_router.py`

`api.py`

Changes Made (already pushed to `fix-readiness-endpoint`)

1. ✅ Fixed critical bug in `api.py` (neubig said: "Fix")

2. ✅ Removed unused `is_ready()` function from `server_details_router.py` (neubig said: "Fix")