Skip to content

DRAFT: Combined branch for swe-bench-multimodal evaluation#1818

Closed
neubig wants to merge 15 commits intomainfrom
eval/swe-bench-multimodal-with-ps1-fix
Closed

DRAFT: Combined branch for swe-bench-multimodal evaluation#1818
neubig wants to merge 15 commits intomainfrom
eval/swe-bench-multimodal-with-ps1-fix

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 24, 2026

Summary

This is a combined branch for running swe-bench-multimodal evaluation that includes:

  1. PR fix: prevent duplicate events in bash polling via order__gt filtering #1816 - fix/polling-output-duplication-bug - Working branch for swe-bench-multimodal evaluation
  2. PR fix: handle PS1 metadata corruption in command output #1817 - fix/ps1-corruption-test - Fix for PS1 metadata corruption from ASCII art in command output

Included Fixes

PS1 Corruption Fix (#1817)

Handles cases where programs like grunt output ASCII art that interrupts PS1 JSON blocks, causing "Expected at least one PS1 metadata block, but got 0" errors.

Polling Output Duplication Fix (#1816)

Fixes for the swe-bench-multimodal evaluation workflow.

Usage

This branch is intended for evaluation runs only. Do not merge - instead merge the individual PRs (#1816 and #1817) separately after review.

@neubig can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8bdb1a9-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8bdb1a9-python \
  ghcr.io/openhands/agent-server:8bdb1a9-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8bdb1a9-golang-amd64
ghcr.io/openhands/agent-server:8bdb1a9-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8bdb1a9-golang-arm64
ghcr.io/openhands/agent-server:8bdb1a9-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8bdb1a9-java-amd64
ghcr.io/openhands/agent-server:8bdb1a9-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8bdb1a9-java-arm64
ghcr.io/openhands/agent-server:8bdb1a9-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8bdb1a9-python-amd64
ghcr.io/openhands/agent-server:8bdb1a9-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:8bdb1a9-python-arm64
ghcr.io/openhands/agent-server:8bdb1a9-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:8bdb1a9-golang
ghcr.io/openhands/agent-server:8bdb1a9-java
ghcr.io/openhands/agent-server:8bdb1a9-python

About Multi-Architecture Support

  • Each variant tag (e.g., 8bdb1a9-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 8bdb1a9-python-amd64) are also available if needed

This adds tests that demonstrate a bug in RemoteWorkspaceMixin where
the polling loop duplicates stdout/stderr output across multiple poll
iterations.

The bug occurs because:
1. The polling loop fetches ALL events from the beginning on each iteration
2. Events are appended to stdout_parts without deduplication
3. This causes output like: A + B + A + B + C + A + B + C + D

This bug causes base64 decoding failures when capturing conversation
trajectories, as the duplicated base64 output becomes invalid:
- 'Incorrect padding'
- 'Invalid base64-encoded string: number of data characters cannot be
   1 more than a multiple of 4'

The tests document the bug and will help verify the fix.

Co-authored-by: openhands <openhands@all-hands.dev>
Remove unused variables flagged by ruff.

Co-authored-by: openhands <openhands@all-hands.dev>
Tests now assert the CORRECT expected behavior:
- Output should be deduplicated across poll iterations
- Base64 output should decode correctly

These tests FAIL because the bug exists, demonstrating:
- Expected: 'CHUNK1CHUNK2CHUNK3'
- Actual: 'CHUNK1CHUNK1CHUNK2CHUNK1CHUNK2CHUNK3'

When the fix is implemented, these tests will pass.

Co-authored-by: openhands <openhands@all-hands.dev>
Added test_base64_decode_produces_incorrect_padding_error which:
1. Simulates the polling loop with duplicated events
2. Calls base64.b64decode() on the corrupted output
3. Fails with: binascii.Error: Incorrect padding

This reproduces the exact error seen in production logs during
trajectory capture.

Co-authored-by: openhands <openhands@all-hands.dev>
The bash events search API returns ALL events from the beginning on each
poll iteration. Without deduplication, output gets duplicated across polls:
- Poll 1: [A] → append A
- Poll 2: [A, B] → append A, B (A duplicated!)
- Poll 3: [A, B, C] → append A, B, C (A, B duplicated again!)

This caused base64 decoding failures in trajectory capture because the
duplicated output length was no longer a multiple of 4:
- Original: 68 chars (valid)
- Duplicated: 119 chars (119 % 4 = 3 → 'Incorrect padding' error)

Fix: Track seen event IDs and skip duplicates. Events without an ID
are processed without deduplication (backwards compatibility).

Co-authored-by: openhands <openhands@all-hands.dev>
This test demonstrates a bug where PS1 metadata blocks get corrupted
when command output (like grunt's ASCII cat art) is interleaved with
the PS1 prompt output.

The bug causes 'Expected at least one PS1 metadata block, but got 0'
errors in production (seen in eval-21310432128-claude-son).

Root cause: The PS1 regex uses non-greedy matching which spans from
the FIRST ###PS1JSON### to the ONLY ###PS1END###, even when ASCII art
corrupts the first block. This creates one giant invalid match that
fails JSON parsing.

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove unused imports (re, patch, CMD_OUTPUT_PS1_BEGIN, CMD_OUTPUT_PS1_END)
- Fix line length issues in assertion messages and test data

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Instead of client-side deduplication, add server-side filtering:

API changes (bash_router.py, bash_service.py):
- Add order__gt query parameter to /api/bash/bash_events/search
- Filter BashOutput events where order > order__gt
- More efficient: reduces data transfer on each poll

Client changes (remote_workspace_mixin.py):
- Track last_order seen (starts at -1)
- Pass order__gt parameter on subsequent polls
- First poll gets all events, subsequent polls get only new ones

This is more efficient than client-side deduplication because:
- Less data transferred over the network
- Server does the filtering instead of client
- No need to track seen event IDs

Co-authored-by: openhands <openhands@all-hands.dev>
The API should prevent duplicates via order__gt filtering, but add a
client-side assertion as a safety check. If the API ever returns a
duplicate event, we fail fast with a clear error message rather than
silently corrupting output.

Also adds test_assertion_fires_on_duplicate_events to verify this
behavior.

Co-authored-by: openhands <openhands@all-hands.dev>
When programs like grunt output ASCII art (e.g., their cat mascot), it can
interrupt the PS1 JSON block being printed to the terminal. This causes
the regex to match from the FIRST ###PS1JSON### to the ONLY ###PS1END###,
creating one giant match with corrupted content that fails JSON parsing.

The fix:
- Detect when a regex match contains a nested ###PS1JSON### marker
- Extract the content after the LAST marker (which is the valid JSON)
- Use a _SyntheticMatch class to provide a match-like interface

This recovers valid PS1 blocks even when earlier blocks are corrupted,
preventing 'Expected at least one PS1 metadata block, but got 0' errors.

Fixes eval job errors like eval-21310432128-claude-son where grunt's
ASCII cat art was causing PS1 parsing failures.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   bash_router.py431272%78–81, 90–91, 105–110
   bash_service.py1882288%69–71, 140–141, 143–144, 174–176, 253, 258–259, 285–286, 313–314, 316, 323–324, 354–355
openhands-sdk/openhands/sdk/workspace/remote
   remote_workspace_mixin.py118992%308, 314–317, 335, 341–343
openhands-tools/openhands/tools/terminal
   metadata.py782765%33–34, 37, 39, 44, 46–47, 50, 53, 116, 120–121, 123–124, 126, 129–130, 132–133, 137–139, 141, 145, 163–164, 168
TOTAL17455463673% 

@openhands-ai
Copy link

openhands-ai bot commented Feb 5, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Agent Server
  • There are merge conflicts

If you'd like me to help, just leave a comment, like

@OpenHands please fix the merge conflicts on PR #1818 at branch `eval/swe-bench-multimodal-with-ps1-fix`

or

@OpenHands please fix the failing actions on PR #1818 at branch `eval/swe-bench-multimodal-with-ps1-fix`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig closed this Feb 5, 2026
@neubig neubig reopened this Feb 5, 2026
@neubig neubig closed this Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants