refactor: remove redundant shutdown reconcile call by xingyaoww · Pull Request #1820 · OpenHands/software-agent-sdk

xingyaoww · 2026-01-24T23:16:29Z

Summary

This PR removes the reconcile() call that was added in PR #1791 after run completion. Testing shows this call is redundant because the WebSocket subscription handshake and the initial reconcile() call in __init__ are sufficient to ensure all events are captured.

Maybe this is unnecessary 🤔 We could remove it to be a bit conservative / preserve better performance on WS

Changes

Removed the following code from _wait_for_run_completion():

# Reconcile events to ensure we have all events that may have
# been emitted during the final moments of the run. This handles
# the race condition where events are published after the client
# detects "finished" status but before WebSocket delivers them.
self._state.events.reconcile()

Testing

All tests pass without this call:

test_events_not_lost_during_client_disconnection - 5 consecutive runs ✅
test_remote_conversation_over_real_server - 5 consecutive runs ✅
All 66 remote conversation tests ✅
All 1271 SDK tests ✅

Rationale

The original PR #1791 added two reconcile() calls:

In __init__ after wait_until_ready() - to catch events between initial REST sync and WebSocket subscription
In _wait_for_run_completion after detecting "finished" status - to catch events during shutdown

The second call appears to be unnecessary because:

The WebSocket subscription is already established and receiving events
The initial reconcile in __init__ ensures the client starts with a complete view
Events emitted during the run are delivered via WebSocket in real-time

Removing this call simplifies the code and reduces an unnecessary REST API call at the end of each run.

Related to #1791

@xingyaoww can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:14cb068-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-14cb068-python \
  ghcr.io/openhands/agent-server:14cb068-python

All tags pushed for this build

ghcr.io/openhands/agent-server:14cb068-golang-amd64
ghcr.io/openhands/agent-server:14cb068-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:14cb068-golang-arm64
ghcr.io/openhands/agent-server:14cb068-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:14cb068-java-amd64
ghcr.io/openhands/agent-server:14cb068-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:14cb068-java-arm64
ghcr.io/openhands/agent-server:14cb068-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:14cb068-python-amd64
ghcr.io/openhands/agent-server:14cb068-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:14cb068-python-arm64
ghcr.io/openhands/agent-server:14cb068-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:14cb068-golang
ghcr.io/openhands/agent-server:14cb068-java
ghcr.io/openhands/agent-server:14cb068-python

About Multi-Architecture Support

Each variant tag (e.g., 14cb068-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 14cb068-python-amd64) are also available if needed

The reconcile() call after run completion was added to handle a potential race condition where events might be missed during client disconnection. However, testing shows that the WebSocket subscription handshake and the initial reconcile() call in __init__ are sufficient to ensure all events are captured. All tests pass without this call, including: - test_events_not_lost_during_client_disconnection (5 consecutive runs) - test_remote_conversation_over_real_server (5 consecutive runs) - All 66 remote conversation tests - All 1271 SDK tests Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

This PR removes defensive code that was specifically added to handle a documented race condition. While tests pass, this doesn't prove the race condition cannot occur—it may only manifest under specific timing conditions. See inline comments for details.

all-hands-bot · 2026-01-24T23:24:59Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

-                    # the race condition where events are published after the client
-                    # detects "finished" status but before WebSocket delivers them.
-                    self._state.events.reconcile()
                    logger.info(


🔴 Critical: The reconcile() call that was removed here contradicts the explicit documentation in test_events_not_lost_during_client_disconnection, which states:

"The fix: After run() completes, we call reconcile() to fetch any events that may have been missed via WebSocket. This ensures the client always has a complete view of all events."

The test describes a real race condition:

Server emits final events (ActionEvent, ObservationEvent)

Status polling via REST API detects "finished"

run() returns and client may start WebSocket disconnection

Events emitted during this window may not be delivered via WebSocket

Why the race exists:

REST API (status polling) and WebSocket (event delivery) are independent, asynchronous channels running in separate threads

There's no synchronization guarantee that WebSocket events arrive before status polling returns "finished"

Events could be in-flight, buffered, or delayed when the status changes

Why test passing isn't sufficient evidence:

Race conditions are timing-dependent and may only manifest under specific conditions (network latency, server load, concurrent operations, specific event timing)

5 test runs is not statistically significant for catching intermittent race conditions

Tests may not simulate the exact production conditions where this race occurs (especially under load or with network delays)

Recommendation:
Keep the reconcile() call unless you can demonstrate:

A technical reason why the race condition is impossible (e.g., synchronization mechanism between REST status and WebSocket events)

OR architectural changes that guarantee REST and WebSocket delivery ordering

OR extensive stress testing with deliberately introduced delays/latency to prove events are never lost

The cost of this REST API call (milliseconds) is minimal compared to the risk of intermittent event loss in production, which would be very difficult to debug.

enyst · 2026-01-24T23:37:11Z

@OpenHands WDYT about this PR?

Make sure you read 1791 and investigate the code throughfully. Do not guess. Make the project and try to repro a shutdown timing issue. You are allowed to explicitly set up conditions for a script like test_events_not_lost_during_client_disconnection for testing purposes.

openhands-ai · 2026-01-24T23:37:20Z

I'm on it! enyst can track my progress at all-hands.dev

openhands-ai · 2026-01-24T23:42:38Z

I checked out the PR branch openhands/remove-shutdown-reconcile (PR refactor: remove redundant shutdown reconcile call #1820) and reviewed the exact diff vs origin/main: it only removes the post-run self._state.events.reconcile() call in RemoteConversation._wait_for_run_completion().
I then investigated PR fix: ensure WebSocket is subscribed before operations #1791’s intent by reading the in-repo regression test tests/cross/test_remote_conversation_live_server.py::test_events_not_lost_during_client_disconnection, which explicitly documents the shutdown/disconnect timing race and states the fix is a reconcile after run() completes.
I set up the project environment (using uv and the repo’s .venv) and ran the regression test:
- It passed in a single run.
- When run repeatedly in a loop to induce timing variance, I reproduced an intermittent failure on this PR branch: REST contained the expected ActionEvent(finish)/ObservationEvent(finish) but the client’s websocket-cached events were missing at least the ActionEvent(finish) after conv.run() returned (status observed as idle). This matches the race described in fix: ensure WebSocket is subscribed before operations #1791.
Conclusion: removing the post-run reconcile() is not safe; it can reintroduce the shutdown timing/event-loss issue that fix: ensure WebSocket is subscribed before operations #1791 addressed.

Checklist against your instructions:

Read fix: ensure WebSocket is subscribed before operations #1791 context (via the referenced regression test and code paths).
Investigated code thoroughly (diff + relevant functions).
Built/ran the project tests in a real environment (created .venv via uv sync).
Tried to repro shutdown timing issue and succeeded by explicitly stressing conditions (repeat-run loop).
Did not guess; conclusions are based on observed failing behavior.
No code changes were made, so no push/PR update was performed.

View full conversation

smolpaws · 2026-01-24T23:53:26Z

I reproduced the shutdown/disconnect timing event-loss that PR #1791 was intended to fix, on this PR branch (where the post-run reconcile() is removed).

Setup

I synced deps into the repo-managed venv and ran tests with that interpreter:

cd /workspace/project/software-agent-sdk
uv sync --dev
./.venv/bin/python -m pytest -q tests/cross/test_remote_conversation_live_server.py::test_events_not_lost_during_client_disconnection -q

Single run

A single run often passes.

Repro: push timing variance via a loop

Running it repeatedly (to introduce scheduling/timing variance) produces intermittent failures:

cd /workspace/project/software-agent-sdk
for i in $(seq 1 20); do
  echo "RUN $i"
  ./.venv/bin/python -m pytest -q \
    tests/cross/test_remote_conversation_live_server.py::test_events_not_lost_during_client_disconnection -q \
    || break
done

Failure excerpt (representative)

On a failing iteration, the regression is exactly what #1791 describes: the REST API has the tail events persisted, but the client-side (WS-cached) event list is missing them after conv.run() returns.

FAILED tests/cross/test_remote_conversation_live_server.py::test_events_not_lost_during_client_disconnection - AssertionError: ActionEvent with finish tool not found in client events. REST API has 1 ActionEvent(s) but client has 0. Reconciliation should have fetched missing events. Client events: ['ConversationStateUpdateEvent(N/A)', 'SystemPromptEvent(N/A)', 'MessageEvent(N/A)']. REST events: ['SystemPromptEvent(N/A)', 'MessageEvent(N/A)', 'ConversationStateUpdateEvent(N/A)', 'ActionEvent(finish)', 'ObservationEvent(finish)', 'ConversationStateUpdateEvent(N/A)']

The captured logs also show the client observes completion immediately, then the websocket disconnects:

Run completed with status: idle (elapsed: 0.0s)
Event websocket disconnected: <conversation-id>

Conclusion

Removing the post-run self._state.events.reconcile() in RemoteConversation._wait_for_run_completion() is not safe: it reintroduces an intermittent race where conv.run() returns after polling sees a terminal status, but the websocket drops/doesn’t deliver tail events that are already persisted and visible via REST.

Keeping the post-run reconcile (or replacing it with an equivalent “drain tail events” mechanism) is necessary for correctness under timing variance.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-01-25T01:00:36Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation/impl
remote_conversation.py	549	103	81%	127, 154, 167, 169–172, 182, 204–205, 210–213, 289, 299–301, 307, 348, 480–483, 485, 505–509, 514–517, 520, 651–652, 656–657, 668, 687–688, 707, 726–729, 731–732, 750, 756–757, 761, 766–767, 772–774, 777–781, 783–784, 788, 790–798, 800, 837, 969–970, 974, 979–983, 989–995, 1008–1009, 1085, 1092, 1098–1099, 1127, 1133–1134, 1141–1142
TOTAL	16430	4799	70%

enyst

I had the agent retry, including for 50x times => test_events_not_lost_during_client_disconnection never failed.

Technically it seems what the test is testing for, is not fully covered by the server side, but maybe that's a very rare occurrence. We could merge this PR, and see if it becomes a problem (or we still see issues) in reality.

enyst · 2026-01-26T16:12:10Z

Lol... GitHub action in PR #1829 decided to have fun 🫠

FAILED tests/cross/test_remote_conversation_live_server.py::test_events_not_lost_during_client_disconnection - AssertionError: ActionEvent with finish tool not found in client events. REST API has 1 ActionEvent(s) but client has 0. Reconciliation should have fetched missing events. Client events: ['ConversationStateUpdateEvent(N/A)', 'SystemPromptEvent(N/A)', 'MessageEvent(N/A)']. REST events: ['SystemPromptEvent(N/A)', 'MessageEvent(N/A)', 'ConversationStateUpdateEvent(N/A)', 'ActionEvent(finish)', 'ObservationEvent(finish)', 'ConversationStateUpdateEvent(N/A)']
assert False
============ 1 failed, 55 passed, 1 skipped, 15 warnings in 33.79s =============
Error: Process completed with exit code 1.

xingyaoww · 2026-01-26T16:32:43Z

@enyst 🫠 ugh, the action seems like duplicated events? @OpenHands can you look at https://github.com/OpenHands/software-agent-sdk/actions/runs/21364607784/job/61492749827?pr=1829#step:7:5709, understand why it fails, and help me reliably can reproduce it & send a new PR if you are able to successfully reproduce it

openhands-ai · 2026-01-26T16:32:53Z

I'm on it! xingyaoww can track my progress at all-hands.dev

The reconcile() call after run completion was removed in PR #1820, but this caused a race condition where events emitted during the final moments of the run could be lost if the WebSocket didn't deliver them before run() returned. This was observed in CI where test_events_not_lost_during_client_disconnection failed because the client only received 3 events while the REST API had 6 events - the ActionEvent(finish) and ObservationEvent(finish) were missing. The fix restores the reconcile() call in _wait_for_run_completion() to ensure all events are captured after run completion. This is safe because reconcile() is idempotent and will only add events that are missing from the client's cache. Fixes the flaky test failure in PR #1829. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-26T16:42:03Z

Summary

I analyzed the GitHub Actions failure in PR #1829 and identified the root cause. Here's what I found:

Problem Analysis

The test test_events_not_lost_during_client_disconnection failed with:

AssertionError: ActionEvent with finish tool not found in client events. 
REST API has 1 ActionEvent(s) but client has 0.
Client events: ['ConversationStateUpdateEvent(N/A)', 'SystemPromptEvent(N/A)', 'MessageEvent(N/A)']. 
REST events: ['SystemPromptEvent(N/A)', 'MessageEvent(N/A)', 'ConversationStateUpdateEvent(N/A)', 'ActionEvent(finish)', 'ObservationEvent(finish)', 'ConversationStateUpdateEvent(N/A)']

Root Cause: PR #1820 removed the reconcile() call in _wait_for_run_completion(), which was supposed to catch events that the WebSocket didn't deliver before run() returned. In the CI failure, the run completed so fast (0.0s elapsed) that the WebSocket didn't have time to deliver the ActionEvent(finish) and ObservationEvent(finish) events.

Note: The comment about "duplicated events" was incorrect - the issue is about missing events, not duplicated ones. The PR #1829 changes (StuckDetector cap) are unrelated to this failure.

Fix

I restored the reconcile() call in _wait_for_run_completion() to ensure all events are captured after run completion. This is safe because reconcile() is idempotent and only adds events missing from the client's cache.

Changes Made

File: openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py
Change: Added back the reconcile() call after detecting run completion (5 lines added)

PR Created

PR #1832: #1832

This fix addresses the flaky test failure observed in PR #1829 by ensuring the client always has a complete view of all events after run() returns.

View full conversation

The reconcile() call after run completion was removed in PR #1820, but this caused a race condition where events emitted during the final moments of the run could be lost if the WebSocket didn't deliver them before run() returned. This was observed in CI where test_events_not_lost_during_client_disconnection failed because the client only received 3-4 events while the REST API had 6 events - the ActionEvent(finish) and ObservationEvent(finish) were missing. Reproduction: - Inject a 3s delay in the WebSocket callback for finish events - Run the conversation with a finish tool call - Observe that without the reconcile() call, the client is missing events The fix restores the reconcile() call in _wait_for_run_completion() to ensure all events are captured after run completion. This is safe because reconcile() is idempotent and will only add events that are missing from the client's cache. Fixes the flaky test failure in PR #1829. Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww marked this pull request as ready for review January 24, 2026 23:20

xingyaoww requested review from enyst and jpshackelford January 24, 2026 23:20

all-hands-bot reviewed Jan 24, 2026

View reviewed changes

OpenHands deleted a comment from smolpaws Jan 24, 2026

test: add WS-callback-lag coverage for post-run reconcile

c7f44ef

Co-authored-by: openhands <openhands@all-hands.dev>

enyst approved these changes Jan 25, 2026

View reviewed changes

xingyaoww merged commit f40f2de into main Jan 26, 2026
25 checks passed

xingyaoww deleted the openhands/remove-shutdown-reconcile branch January 26, 2026 03:50

xingyaoww mentioned this pull request Jan 26, 2026

fix: wait for WebSocket terminal status to prevent event loss #1832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: remove redundant shutdown reconcile call#1820

refactor: remove redundant shutdown reconcile call#1820
xingyaoww merged 2 commits intomainfrom
openhands/remove-shutdown-reconcile

xingyaoww commented Jan 24, 2026 •

edited by github-actions bot

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Jan 24, 2026

Uh oh!

enyst commented Jan 24, 2026

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

smolpaws commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 25, 2026

Uh oh!

enyst left a comment

Uh oh!

Uh oh!

enyst commented Jan 26, 2026

Uh oh!

xingyaoww commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xingyaoww commented Jan 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Rationale

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

enyst commented Jan 24, 2026

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

smolpaws commented Jan 24, 2026

Setup

Single run

Repro: push timing variance via a loop

Failure excerpt (representative)

Conclusion

Uh oh!

github-actions bot commented Jan 25, 2026

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

enyst commented Jan 26, 2026

Uh oh!

xingyaoww commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Summary

Problem Analysis

Fix

Changes Made

PR Created

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xingyaoww commented Jan 24, 2026 •

edited by github-actions bot

Loading