-
Notifications
You must be signed in to change notification settings - Fork 117
fix: handle PS1 metadata corruption in command output #1817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This test demonstrates a bug where PS1 metadata blocks get corrupted when command output (like grunt's ASCII cat art) is interleaved with the PS1 prompt output. The bug causes 'Expected at least one PS1 metadata block, but got 0' errors in production (seen in eval-21310432128-claude-son). Root cause: The PS1 regex uses non-greedy matching which spans from the FIRST ###PS1JSON### to the ONLY ###PS1END###, even when ASCII art corrupts the first block. This creates one giant invalid match that fails JSON parsing. Co-authored-by: openhands <openhands@all-hands.dev>
- Remove unused imports (re, patch, CMD_OUTPUT_PS1_BEGIN, CMD_OUTPUT_PS1_END) - Fix line length issues in assertion messages and test data Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
When programs like grunt output ASCII art (e.g., their cat mascot), it can interrupt the PS1 JSON block being printed to the terminal. This causes the regex to match from the FIRST ###PS1JSON### to the ONLY ###PS1END###, creating one giant match with corrupted content that fails JSON parsing. The fix: - Detect when a regex match contains a nested ###PS1JSON### marker - Extract the content after the LAST marker (which is the valid JSON) - Use a _SyntheticMatch class to provide a match-like interface This recovers valid PS1 blocks even when earlier blocks are corrupted, preventing 'Expected at least one PS1 metadata block, but got 0' errors. Fixes eval job errors like eval-21310432128-claude-son where grunt's ASCII cat art was causing PS1 parsing failures. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands it is not very clear from the PR description what it means to have "PS1 blocks corrupted by ascii art". What is the full scope of things that can be corrupted and what is the underlying mechanism that caused this to happen? For evidence that this happened in the past, check previous PRs to OpenHands/openhands-index-results and download the results from SWE-Bench-Multimodal for claude-sonnet-4.5. The evidence is in one of those sets of results from last week (probably Thursday or Friday). Find more examples and also understand the causal mechanism, and update the PR description to be more precise. |
|
I'm on it! neubig can track my progress at all-hands.dev |
SummaryI investigated the PS1 corruption issue and updated the PR description (#1817) with more precise details about: What I Found
Changes Made
No code changes were made - only the PR description was updated to be more precise and include evidence from the SWE-Bench-Multimodal evaluation results. PR Link: #1817 |
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one critical issue with position handling and a few minor improvements needed. See inline comments for details.
| def start(self, group: int = 0) -> int: | ||
| return self._original_match.start(group) | ||
|
|
||
| def end(self, group: int = 0) -> int: | ||
| return self._original_match.end(group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Critical: These methods return positions from the original corrupted match, not the actual position of the recovered content in the string. This is a critical bug because terminal_session.py uses these positions to slice terminal content:
# From terminal_session.py:
terminal_content[ps1_matches[0].start():ps1_matches[0].end()]
terminal_content[ps1_matches[i].end() + 1 : ps1_matches[i + 1].start()]With incorrect positions, this will extract the wrong content and break command output parsing.
Suggested fix: Calculate and store the actual positions when creating the synthetic match:
| def start(self, group: int = 0) -> int: | |
| return self._original_match.start(group) | |
| def end(self, group: int = 0) -> int: | |
| return self._original_match.end(group) | |
| def start(self, group: int = 0) -> int: | |
| # Return the actual position of recovered content in original string | |
| if group == 0: | |
| return self._actual_start | |
| elif group == 1: | |
| return self._actual_start + len(CMD_OUTPUT_PS1_BEGIN.strip()) + 1 | |
| raise IndexError(f"no such group: {group}") | |
| def end(self, group: int = 0) -> int: | |
| # Return the actual end position of recovered content | |
| if group == 0: | |
| return self._actual_end | |
| elif group == 1: | |
| return self._actual_end - len(CMD_OUTPUT_PS1_END.strip()) - 1 | |
| raise IndexError(f"no such group: {group}") |
You would need to modify the constructor to accept and store _actual_start and _actual_end positions, calculated from original_match.start(1) + last_marker_pos when creating the synthetic match on line 132.
| matches.append(_SyntheticMatch(last_block_content, match)) | ||
| logger.debug( | ||
| "Recovered valid PS1 block from corrupted " | ||
| f"output: {last_block_content[:80]}..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟢 Nit: The truncation adds "..." without checking if the content is actually longer than 80 characters.
| f"output: {last_block_content[:80]}..." | |
| f"output: {last_block_content[:80]}{'...' if len(last_block_content) > 80 else ''}" |
|
|
||
| logger.debug( | ||
| f"Failed to parse PS1 metadata - Skipping: [{match.group(1)}]" | ||
| f"Failed to parse PS1 metadata - Skipping: [{content[:200]}...]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟢 Nit: Same issue - adds "..." without checking length first.
| f"Failed to parse PS1 metadata - Skipping: [{content[:200]}...]" | |
| f"Failed to parse PS1 metadata - Skipping: [{content[:200]}{'...' if len(content) > 200 else ''}]" |
| """ | ||
| Test that grunt's ASCII cat art causes ZERO valid PS1 matches. | ||
| This is the ROOT CAUSE of the production error: | ||
| "Expected at least one PS1 metadata block, but got 0." | ||
| The regex matches one block but the content fails JSON parsing | ||
| because it contains ASCII art and a nested ###PS1JSON### marker. | ||
| This test FAILS because the current behavior is broken - we get | ||
| 0 matches when there IS a valid JSON block at the end of the output. | ||
| A fix should return at least 1 valid match. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Suggestion: This docstring is confusing because it says "This test FAILS because the current behavior is broken" but the test actually asserts that the fix IS working (expects >= 1 matches).
Update the docstring to reflect that this test verifies the FIX works correctly:
| """ | |
| Test that grunt's ASCII cat art causes ZERO valid PS1 matches. | |
| This is the ROOT CAUSE of the production error: | |
| "Expected at least one PS1 metadata block, but got 0." | |
| The regex matches one block but the content fails JSON parsing | |
| because it contains ASCII art and a nested ###PS1JSON### marker. | |
| This test FAILS because the current behavior is broken - we get | |
| 0 matches when there IS a valid JSON block at the end of the output. | |
| A fix should return at least 1 valid match. | |
| """ | |
| """ | |
| Test that the fix correctly recovers valid PS1 blocks from corrupted output. | |
| When grunt's ASCII cat art corrupts the first PS1 block, the fix should | |
| detect the nested ###PS1JSON### marker and extract the valid second block. | |
| This is the ROOT CAUSE of the production error: | |
| "Expected at least one PS1 metadata block, but got 0." | |
| The fix handles this by finding the LAST ###PS1JSON### marker and extracting | |
| the valid JSON that follows it. | |
| """ |
|
@OpenHands /codereview-roasted can you checkout a new temporary branch from this PR (but do NOT push!), then try to remove all the implementations from this PR. Then tell me, can you reproduce the test failure mentioned in the PR description? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Summary: Test Failure Reproduction Confirmed ✅I followed your instructions to verify whether the test failure mentioned in the PR description can be reproduced: What I Did:
Result: YES, the test failure can be reproducedWhen the fix is removed, the test This matches exactly what the PR description claims:
Conclusion:The PR's fix is legitimate and addresses a real bug. The |
|
@OpenHands read the critical comment from all-hands-bot and write a test to reproduce the "critical bug" |
|
I'm on it! neubig can track my progress at all-hands.dev |
Add TestSyntheticMatchPositionBug class with tests that demonstrate the critical bug where _SyntheticMatch.start() and .end() return positions from the original corrupted match instead of the recovered content position. This causes terminal_session.py to extract wrong content when slicing terminal_content using match positions. See: #1817 (comment) Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI read the critical comment from all-hands-bot on PR #1817 and wrote a test to reproduce the "critical bug" as requested. The Critical BugThe all-hands-bot identified that the # From terminal_session.py:
terminal_content[ps1_matches[0].start():ps1_matches[0].end()]
terminal_content[ps1_matches[i].end() + 1 : ps1_matches[i + 1].start()]With incorrect positions, this extracts the wrong content and breaks command output parsing. Changes MadeAdded a new test class
Test ResultsThe first test fails as expected, clearly demonstrating the critical bug with a detailed error message showing the mismatch between expected and actual content. CommitThe changes have been committed and pushed to the
The PR #1817 has been updated with this new test that reproduces the critical bug identified by all-hands-bot. |
Tests should fail when demonstrating undesirable behavior. Updated test_synthetic_match_start_returns_correct_position and test_synthetic_match_end_returns_correct_position to assert the CORRECT behavior, which will fail until the bug is fixed. Co-authored-by: openhands <openhands@all-hands.dev>
Renamed test_synthetic_match_positions_are_incorrect to test_slicing_with_match_positions_returns_group_zero to clearly describe the correct behavior we expect (which will pass when fixed). Co-authored-by: openhands <openhands@all-hands.dev>
Remove redundant tests. The single test_slicing_with_match_positions_returns_group_zero test covers both start() and end() by verifying the fundamental contract that terminal_content[match.start():match.end()] == match.group(0). Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
- Fix _SyntheticMatch.start() and .end() to return correct positions for the recovered content instead of the original corrupted match - Use original (unstripped) group(1) content to find nested marker position - Add test_synthetic_match_slicing_returns_group_zero to verify the fix - Consolidate test into a single function instead of a class The fix ensures that terminal_content[match.start():match.end()] equals match.group(0), which is the fundamental contract of a match object. Co-authored-by: openhands <openhands@all-hands.dev>
Summary
This PR fixes a bug where PS1 metadata blocks get corrupted when programs output content (like ASCII art) concurrently with the shell's PS1 prompt rendering, causing evaluation failures.
What is PS1 Metadata?
The terminal uses a custom PS1 prompt that outputs JSON metadata between special markers. This metadata includes:
exit_code: The exit code of the last commandworking_dir: Current working directoryusername: Current userhostname: Machine hostnamepy_interpreter_path: Path to Python interpreterExample of a valid PS1 block:
The actual markers are:
###PS1JSON###and###PS1END###The Corruption Mechanism
Root Cause: Terminal Output Race Condition
When a program outputs content to stdout while the shell is simultaneously rendering the PS1 prompt, the outputs get interleaved due to terminal buffering. This is a race condition between:
What Can Cause Corruption
Any program that outputs to stdout while the PS1 prompt is being rendered:
Evidence from Production
Found in SWE-Bench-Multimodal evaluation results for claude-sonnet-4.5 (Jan 24, 2026):
eval-21318221216-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_26-01-24-19-46.tar.gzprocessing__p5.js-3709,processing__p5.js-5771Example of corrupted output (grunt's ASCII cat interleaved with PS1 metadata):
The
8 /wat the beginning is part of the PS1 JSON ("exit_code": "0"and"working_dir": "/workspace/p5.js"), and the ASCII art is interleaved with it.How the Regex Fails
The PS1 regex uses non-greedy matching:
###PS1JSON###(.*?)###PS1END###When corruption occurs:
###PS1JSON###block starts###PS1JSON###block starts (from the next prompt)###PS1END###marker at the endThe non-greedy regex matches from the FIRST
###PS1JSON###to the ONLY###PS1END###, creating ONE giant match containing:###PS1JSON###markerThis combined content fails JSON parsing → 0 valid matches → AssertionError:
The Fix
Modified
CmdOutputMetadata.matches_ps1_metadata()inmetadata.pyto:###PS1JSON###marker_SyntheticMatchclass provides a match-like interface so existing code works transparentlyTest Coverage
Added comprehensive tests in
test_ps1_corruption.py:test_corrupted_ps1_regex_matches_wrong_content- Documents the regex behaviortest_corrupted_ps1_causes_zero_valid_matches- Now passes (was failing before the fix)test_handle_completed_command_fails_with_corrupted_output- Tests assertion behaviortest_pager_output_causes_zero_ps1_matches- Tests pager scenariosVerification
# All tests pass uv run pytest tests/tools/terminal/test_ps1_corruption.py tests/tools/terminal/test_terminal_ps1_metadata.py -vRelated
eval-21318221216-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_26-01-24-19-46.tar.gzprocessing__p5.js-3709,processing__p5.js-5771Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:5d191ba-pythonRun
All tags pushed for this build
About Multi-Architecture Support
5d191ba-python) is a multi-arch manifest supporting both amd64 and arm645d191ba-python-amd64) are also available if needed