DRAFT: Combined branch for swe-bench-multimodal evaluation#1818
Closed
DRAFT: Combined branch for swe-bench-multimodal evaluation#1818
Conversation
This adds tests that demonstrate a bug in RemoteWorkspaceMixin where the polling loop duplicates stdout/stderr output across multiple poll iterations. The bug occurs because: 1. The polling loop fetches ALL events from the beginning on each iteration 2. Events are appended to stdout_parts without deduplication 3. This causes output like: A + B + A + B + C + A + B + C + D This bug causes base64 decoding failures when capturing conversation trajectories, as the duplicated base64 output becomes invalid: - 'Incorrect padding' - 'Invalid base64-encoded string: number of data characters cannot be 1 more than a multiple of 4' The tests document the bug and will help verify the fix. Co-authored-by: openhands <openhands@all-hands.dev>
Remove unused variables flagged by ruff. Co-authored-by: openhands <openhands@all-hands.dev>
Tests now assert the CORRECT expected behavior: - Output should be deduplicated across poll iterations - Base64 output should decode correctly These tests FAIL because the bug exists, demonstrating: - Expected: 'CHUNK1CHUNK2CHUNK3' - Actual: 'CHUNK1CHUNK1CHUNK2CHUNK1CHUNK2CHUNK3' When the fix is implemented, these tests will pass. Co-authored-by: openhands <openhands@all-hands.dev>
Added test_base64_decode_produces_incorrect_padding_error which: 1. Simulates the polling loop with duplicated events 2. Calls base64.b64decode() on the corrupted output 3. Fails with: binascii.Error: Incorrect padding This reproduces the exact error seen in production logs during trajectory capture. Co-authored-by: openhands <openhands@all-hands.dev>
The bash events search API returns ALL events from the beginning on each poll iteration. Without deduplication, output gets duplicated across polls: - Poll 1: [A] → append A - Poll 2: [A, B] → append A, B (A duplicated!) - Poll 3: [A, B, C] → append A, B, C (A, B duplicated again!) This caused base64 decoding failures in trajectory capture because the duplicated output length was no longer a multiple of 4: - Original: 68 chars (valid) - Duplicated: 119 chars (119 % 4 = 3 → 'Incorrect padding' error) Fix: Track seen event IDs and skip duplicates. Events without an ID are processed without deduplication (backwards compatibility). Co-authored-by: openhands <openhands@all-hands.dev>
This test demonstrates a bug where PS1 metadata blocks get corrupted when command output (like grunt's ASCII cat art) is interleaved with the PS1 prompt output. The bug causes 'Expected at least one PS1 metadata block, but got 0' errors in production (seen in eval-21310432128-claude-son). Root cause: The PS1 regex uses non-greedy matching which spans from the FIRST ###PS1JSON### to the ONLY ###PS1END###, even when ASCII art corrupts the first block. This creates one giant invalid match that fails JSON parsing. Co-authored-by: openhands <openhands@all-hands.dev>
- Remove unused imports (re, patch, CMD_OUTPUT_PS1_BEGIN, CMD_OUTPUT_PS1_END) - Fix line length issues in assertion messages and test data Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Instead of client-side deduplication, add server-side filtering: API changes (bash_router.py, bash_service.py): - Add order__gt query parameter to /api/bash/bash_events/search - Filter BashOutput events where order > order__gt - More efficient: reduces data transfer on each poll Client changes (remote_workspace_mixin.py): - Track last_order seen (starts at -1) - Pass order__gt parameter on subsequent polls - First poll gets all events, subsequent polls get only new ones This is more efficient than client-side deduplication because: - Less data transferred over the network - Server does the filtering instead of client - No need to track seen event IDs Co-authored-by: openhands <openhands@all-hands.dev>
The API should prevent duplicates via order__gt filtering, but add a client-side assertion as a safety check. If the API ever returns a duplicate event, we fail fast with a clear error message rather than silently corrupting output. Also adds test_assertion_fires_on_duplicate_events to verify this behavior. Co-authored-by: openhands <openhands@all-hands.dev>
When programs like grunt output ASCII art (e.g., their cat mascot), it can interrupt the PS1 JSON block being printed to the terminal. This causes the regex to match from the FIRST ###PS1JSON### to the ONLY ###PS1END###, creating one giant match with corrupted content that fails JSON parsing. The fix: - Detect when a regex match contains a nested ###PS1JSON### marker - Extract the content after the LAST marker (which is the valid JSON) - Use a _SyntheticMatch class to provide a match-like interface This recovers valid PS1 blocks even when earlier blocks are corrupted, preventing 'Expected at least one PS1 metadata block, but got 0' errors. Fixes eval job errors like eval-21310432128-claude-son where grunt's ASCII cat art was causing PS1 parsing failures. Co-authored-by: openhands <openhands@all-hands.dev>
…al/swe-bench-multimodal-with-ps1-fix
Contributor
Coverage Report •
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like or Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Co-authored-by: openhands <openhands@all-hands.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a combined branch for running swe-bench-multimodal evaluation that includes:
fix/polling-output-duplication-bug- Working branch for swe-bench-multimodal evaluationfix/ps1-corruption-test- Fix for PS1 metadata corruption from ASCII art in command outputIncluded Fixes
PS1 Corruption Fix (#1817)
Handles cases where programs like grunt output ASCII art that interrupts PS1 JSON blocks, causing "Expected at least one PS1 metadata block, but got 0" errors.
Polling Output Duplication Fix (#1816)
Fixes for the swe-bench-multimodal evaluation workflow.
Usage
This branch is intended for evaluation runs only. Do not merge - instead merge the individual PRs (#1816 and #1817) separately after review.
@neubig can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:8bdb1a9-pythonRun
All tags pushed for this build
About Multi-Architecture Support
8bdb1a9-python) is a multi-arch manifest supporting both amd64 and arm648bdb1a9-python-amd64) are also available if needed