DRAFT: Combined branch for swe-bench-multimodal evaluation by neubig · Pull Request #1818 · OpenHands/software-agent-sdk

neubig · 2026-01-24T15:31:01Z

Summary

This is a combined branch for running swe-bench-multimodal evaluation that includes:

PR fix: prevent duplicate events in bash polling via order__gt filtering #1816 - fix/polling-output-duplication-bug - Working branch for swe-bench-multimodal evaluation
PR fix: handle PS1 metadata corruption in command output #1817 - fix/ps1-corruption-test - Fix for PS1 metadata corruption from ASCII art in command output

Included Fixes

PS1 Corruption Fix (#1817)

Handles cases where programs like grunt output ASCII art that interrupts PS1 JSON blocks, causing "Expected at least one PS1 metadata block, but got 0" errors.

Polling Output Duplication Fix (#1816)

Fixes for the swe-bench-multimodal evaluation workflow.

Usage

This branch is intended for evaluation runs only. Do not merge - instead merge the individual PRs (#1816 and #1817) separately after review.

@neubig can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8bdb1a9-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8bdb1a9-python \
  ghcr.io/openhands/agent-server:8bdb1a9-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8bdb1a9-golang-amd64
ghcr.io/openhands/agent-server:8bdb1a9-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8bdb1a9-golang-arm64
ghcr.io/openhands/agent-server:8bdb1a9-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8bdb1a9-java-amd64
ghcr.io/openhands/agent-server:8bdb1a9-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8bdb1a9-java-arm64
ghcr.io/openhands/agent-server:8bdb1a9-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8bdb1a9-python-amd64
ghcr.io/openhands/agent-server:8bdb1a9-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:8bdb1a9-python-arm64
ghcr.io/openhands/agent-server:8bdb1a9-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:8bdb1a9-golang
ghcr.io/openhands/agent-server:8bdb1a9-java
ghcr.io/openhands/agent-server:8bdb1a9-python

About Multi-Architecture Support

Each variant tag (e.g., 8bdb1a9-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 8bdb1a9-python-amd64) are also available if needed

This adds tests that demonstrate a bug in RemoteWorkspaceMixin where the polling loop duplicates stdout/stderr output across multiple poll iterations. The bug occurs because: 1. The polling loop fetches ALL events from the beginning on each iteration 2. Events are appended to stdout_parts without deduplication 3. This causes output like: A + B + A + B + C + A + B + C + D This bug causes base64 decoding failures when capturing conversation trajectories, as the duplicated base64 output becomes invalid: - 'Incorrect padding' - 'Invalid base64-encoded string: number of data characters cannot be 1 more than a multiple of 4' The tests document the bug and will help verify the fix. Co-authored-by: openhands <openhands@all-hands.dev>

Remove unused variables flagged by ruff. Co-authored-by: openhands <openhands@all-hands.dev>

Tests now assert the CORRECT expected behavior: - Output should be deduplicated across poll iterations - Base64 output should decode correctly These tests FAIL because the bug exists, demonstrating: - Expected: 'CHUNK1CHUNK2CHUNK3' - Actual: 'CHUNK1CHUNK1CHUNK2CHUNK1CHUNK2CHUNK3' When the fix is implemented, these tests will pass. Co-authored-by: openhands <openhands@all-hands.dev>

Added test_base64_decode_produces_incorrect_padding_error which: 1. Simulates the polling loop with duplicated events 2. Calls base64.b64decode() on the corrupted output 3. Fails with: binascii.Error: Incorrect padding This reproduces the exact error seen in production logs during trajectory capture. Co-authored-by: openhands <openhands@all-hands.dev>

The bash events search API returns ALL events from the beginning on each poll iteration. Without deduplication, output gets duplicated across polls: - Poll 1: [A] → append A - Poll 2: [A, B] → append A, B (A duplicated!) - Poll 3: [A, B, C] → append A, B, C (A, B duplicated again!) This caused base64 decoding failures in trajectory capture because the duplicated output length was no longer a multiple of 4: - Original: 68 chars (valid) - Duplicated: 119 chars (119 % 4 = 3 → 'Incorrect padding' error) Fix: Track seen event IDs and skip duplicates. Events without an ID are processed without deduplication (backwards compatibility). Co-authored-by: openhands <openhands@all-hands.dev>

This test demonstrates a bug where PS1 metadata blocks get corrupted when command output (like grunt's ASCII cat art) is interleaved with the PS1 prompt output. The bug causes 'Expected at least one PS1 metadata block, but got 0' errors in production (seen in eval-21310432128-claude-son). Root cause: The PS1 regex uses non-greedy matching which spans from the FIRST ###PS1JSON### to the ONLY ###PS1END###, even when ASCII art corrupts the first block. This creates one giant invalid match that fails JSON parsing. Co-authored-by: openhands <openhands@all-hands.dev>

- Remove unused imports (re, patch, CMD_OUTPUT_PS1_BEGIN, CMD_OUTPUT_PS1_END) - Fix line length issues in assertion messages and test data Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

Instead of client-side deduplication, add server-side filtering: API changes (bash_router.py, bash_service.py): - Add order__gt query parameter to /api/bash/bash_events/search - Filter BashOutput events where order > order__gt - More efficient: reduces data transfer on each poll Client changes (remote_workspace_mixin.py): - Track last_order seen (starts at -1) - Pass order__gt parameter on subsequent polls - First poll gets all events, subsequent polls get only new ones This is more efficient than client-side deduplication because: - Less data transferred over the network - Server does the filtering instead of client - No need to track seen event IDs Co-authored-by: openhands <openhands@all-hands.dev>

The API should prevent duplicates via order__gt filtering, but add a client-side assertion as a safety check. If the API ever returns a duplicate event, we fail fast with a clear error message rather than silently corrupting output. Also adds test_assertion_fires_on_duplicate_events to verify this behavior. Co-authored-by: openhands <openhands@all-hands.dev>

When programs like grunt output ASCII art (e.g., their cat mascot), it can interrupt the PS1 JSON block being printed to the terminal. This causes the regex to match from the FIRST ###PS1JSON### to the ONLY ###PS1END###, creating one giant match with corrupted content that fails JSON parsing. The fix: - Detect when a regex match contains a nested ###PS1JSON### marker - Extract the content after the LAST marker (which is the valid JSON) - Use a _SyntheticMatch class to provide a match-like interface This recovers valid PS1 blocks even when earlier blocks are corrupted, preventing 'Expected at least one PS1 metadata block, but got 0' errors. Fixes eval job errors like eval-21310432128-claude-son where grunt's ASCII cat art was causing PS1 parsing failures. Co-authored-by: openhands <openhands@all-hands.dev>

…al/swe-bench-multimodal-with-ps1-fix

github-actions · 2026-01-24T15:36:54Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-agent-server/openhands/agent_server
bash_router.py	43	12	72%	78–81, 90–91, 105–110
bash_service.py	188	22	88%	69–71, 140–141, 143–144, 174–176, 253, 258–259, 285–286, 313–314, 316, 323–324, 354–355
openhands-sdk/openhands/sdk/workspace/remote
remote_workspace_mixin.py	118	9	92%	308, 314–317, 335, 341–343
openhands-tools/openhands/tools/terminal
metadata.py	78	27	65%	33–34, 37, 39, 44, 46–47, 50, 53, 116, 120–121, 123–124, 126, 129–130, 132–133, 137–139, 141, 145, 163–164, 168
TOTAL	17455	4636	73%

openhands-ai · 2026-02-05T05:26:09Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Agent Server
There are merge conflicts

If you'd like me to help, just leave a comment, like

@OpenHands please fix the merge conflicts on PR #1818 at branch `eval/swe-bench-multimodal-with-ps1-fix`

or

@OpenHands please fix the failing actions on PR #1818 at branch `eval/swe-bench-multimodal-with-ps1-fix`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-agent added 12 commits January 24, 2026 13:55

fix: resolve pre-commit lint errors

410eaf0

Remove unused variables flagged by ruff. Co-authored-by: openhands <openhands@all-hands.dev>

fix: address lint issues in PS1 corruption test

a6f2a66

- Remove unused imports (re, patch, CMD_OUTPUT_PS1_BEGIN, CMD_OUTPUT_PS1_END) - Fix line length issues in assertion messages and test data Co-authored-by: openhands <openhands@all-hands.dev>

style: apply ruff formatting to test file

2c65754

Co-authored-by: openhands <openhands@all-hands.dev>

Merge remote-tracking branch 'origin/fix/ps1-corruption-test' into ev…

c6592b7

…al/swe-bench-multimodal-with-ps1-fix

juanmichelini and others added 2 commits January 27, 2026 00:09

Merge branch 'main' into eval/swe-bench-multimodal-with-ps1-fix

706aff6

Merge branch 'main' into eval/swe-bench-multimodal-with-ps1-fix

1471d62

chore: re-trigger CI

4476c2a

Co-authored-by: openhands <openhands@all-hands.dev>

neubig closed this Feb 5, 2026

neubig reopened this Feb 5, 2026

neubig closed this Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Combined branch for swe-bench-multimodal evaluation#1818

DRAFT: Combined branch for swe-bench-multimodal evaluation#1818
neubig wants to merge 15 commits intomainfrom
eval/swe-bench-multimodal-with-ps1-fix

neubig commented Jan 24, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 24, 2026 •

edited

Loading

Uh oh!

openhands-ai bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Jan 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Included Fixes

PS1 Corruption Fix (#1817)

Polling Output Duplication Fix (#1816)

Usage

Uh oh!

github-actions bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Jan 24, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Jan 24, 2026 •

edited

Loading