Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 24, 2026

Summary

This PR fixes a bug where PS1 metadata blocks get corrupted when programs output content (like ASCII art) concurrently with the shell's PS1 prompt rendering, causing evaluation failures.

What is PS1 Metadata?

The terminal uses a custom PS1 prompt that outputs JSON metadata between special markers. This metadata includes:

  • exit_code: The exit code of the last command
  • working_dir: Current working directory
  • username: Current user
  • hostname: Machine hostname
  • py_interpreter_path: Path to Python interpreter

Example of a valid PS1 block:

<PS1JSON marker>
{
  "pid": "123",
  "exit_code": "0",
  "username": "openhands",
  "hostname": "runtime-abc123",
  "working_dir": "/workspace/p5.js",
  "py_interpreter_path": "/usr/bin/python"
}
<PS1END marker>

The actual markers are: ###PS1JSON### and ###PS1END###

The Corruption Mechanism

Root Cause: Terminal Output Race Condition

When a program outputs content to stdout while the shell is simultaneously rendering the PS1 prompt, the outputs get interleaved due to terminal buffering. This is a race condition between:

  1. The shell writing the PS1 prompt (with JSON metadata)
  2. The program writing its output (ASCII art, progress bars, etc.)

What Can Cause Corruption

Any program that outputs to stdout while the PS1 prompt is being rendered:

  • ASCII art (e.g., grunt's cat mascot)
  • Progress bars and spinners
  • ANSI escape sequences
  • Any concurrent stdout output

Evidence from Production

Found in SWE-Bench-Multimodal evaluation results for claude-sonnet-4.5 (Jan 24, 2026):

  • Archive: eval-21318221216-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_26-01-24-19-46.tar.gz
  • Affected instances: processing__p5.js-3709, processing__p5.js-5771

Example of corrupted output (grunt's ASCII cat interleaved with PS1 metadata):

8 /w-_-_-_-_-_,------,npm test 2>&1 | tail -100
 0   -_-_-_-_-_|   /\_/\
 0   -_-_-_-_-^|__( ^ .^)
     -_-_-_-_-  ""  ""

  8 passing (12ms)

Done.

The 8 /w at the beginning is part of the PS1 JSON ("exit_code": "0" and "working_dir": "/workspace/p5.js"), and the ASCII art is interleaved with it.

How the Regex Fails

The PS1 regex uses non-greedy matching: ###PS1JSON###(.*?)###PS1END###

When corruption occurs:

  1. First ###PS1JSON### block starts
  2. JSON fields begin but get corrupted by interleaved output
  3. Second ###PS1JSON### block starts (from the next prompt)
  4. Complete valid JSON
  5. Single ###PS1END### marker at the end

The non-greedy regex matches from the FIRST ###PS1JSON### to the ONLY ###PS1END###, creating ONE giant match containing:

  • Corrupted first JSON block
  • ASCII art / command output
  • Nested ###PS1JSON### marker
  • Valid second JSON block

This combined content fails JSON parsing → 0 valid matches → AssertionError:

AssertionError: Expected at least one PS1 metadata block, but got 0.

The Fix

Modified CmdOutputMetadata.matches_ps1_metadata() in metadata.py to:

  1. Detect nested markers: When a regex match fails JSON parsing, check if it contains a nested ###PS1JSON### marker
  2. Extract the last valid block: Find the LAST occurrence of the marker and extract content after it
  3. Use a synthetic match: The _SyntheticMatch class provides a match-like interface so existing code works transparently

Test Coverage

Added comprehensive tests in test_ps1_corruption.py:

  • test_corrupted_ps1_regex_matches_wrong_content - Documents the regex behavior
  • test_corrupted_ps1_causes_zero_valid_matches - Now passes (was failing before the fix)
  • test_handle_completed_command_fails_with_corrupted_output - Tests assertion behavior
  • test_pager_output_causes_zero_ps1_matches - Tests pager scenarios
  • Several other edge case tests

Verification

# All tests pass
uv run pytest tests/tools/terminal/test_ps1_corruption.py tests/tools/terminal/test_terminal_ps1_metadata.py -v

Related

  • SWE-Bench-Multimodal results: eval-21318221216-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_26-01-24-19-46.tar.gz
  • Affected instances: processing__p5.js-3709, processing__p5.js-5771

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:5d191ba-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-5d191ba-python \
  ghcr.io/openhands/agent-server:5d191ba-python

All tags pushed for this build

ghcr.io/openhands/agent-server:5d191ba-golang-amd64
ghcr.io/openhands/agent-server:5d191ba-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:5d191ba-golang-arm64
ghcr.io/openhands/agent-server:5d191ba-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:5d191ba-java-amd64
ghcr.io/openhands/agent-server:5d191ba-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:5d191ba-java-arm64
ghcr.io/openhands/agent-server:5d191ba-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:5d191ba-python-amd64
ghcr.io/openhands/agent-server:5d191ba-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:5d191ba-python-arm64
ghcr.io/openhands/agent-server:5d191ba-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:5d191ba-golang
ghcr.io/openhands/agent-server:5d191ba-java
ghcr.io/openhands/agent-server:5d191ba-python

About Multi-Architecture Support

  • Each variant tag (e.g., 5d191ba-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 5d191ba-python-amd64) are also available if needed

This test demonstrates a bug where PS1 metadata blocks get corrupted
when command output (like grunt's ASCII cat art) is interleaved with
the PS1 prompt output.

The bug causes 'Expected at least one PS1 metadata block, but got 0'
errors in production (seen in eval-21310432128-claude-son).

Root cause: The PS1 regex uses non-greedy matching which spans from
the FIRST ###PS1JSON### to the ONLY ###PS1END###, even when ASCII art
corrupts the first block. This creates one giant invalid match that
fails JSON parsing.

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove unused imports (re, patch, CMD_OUTPUT_PS1_BEGIN, CMD_OUTPUT_PS1_END)
- Fix line length issues in assertion messages and test data

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
When programs like grunt output ASCII art (e.g., their cat mascot), it can
interrupt the PS1 JSON block being printed to the terminal. This causes
the regex to match from the FIRST ###PS1JSON### to the ONLY ###PS1END###,
creating one giant match with corrupted content that fails JSON parsing.

The fix:
- Detect when a regex match contains a nested ###PS1JSON### marker
- Extract the content after the LAST marker (which is the valid JSON)
- Use a _SyntheticMatch class to provide a match-like interface

This recovers valid PS1 blocks even when earlier blocks are corrupted,
preventing 'Expected at least one PS1 metadata block, but got 0' errors.

Fixes eval job errors like eval-21310432128-claude-son where grunt's
ASCII cat art was causing PS1 parsing failures.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig changed the title test: add failing test for PS1 metadata corruption bug fix: handle PS1 metadata corruption from ASCII art in command output Jan 24, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/terminal
   metadata.py893857%35, 39, 41, 43, 46, 48, 53, 55–56, 59–61, 63–64, 67–69, 71–72, 135, 139, 141–142, 144–145, 147, 150–151, 154, 159, 163–165, 167, 171, 189–190, 194
TOTAL17022852449% 

@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands it is not very clear from the PR description what it means to have "PS1 blocks corrupted by ascii art". What is the full scope of things that can be corrupted and what is the underlying mechanism that caused this to happen? For evidence that this happened in the past, check previous PRs to OpenHands/openhands-index-results and download the results from SWE-Bench-Multimodal for claude-sonnet-4.5. The evidence is in one of those sets of results from last week (probably Thursday or Friday). Find more examples and also understand the causal mechanism, and update the PR description to be more precise.

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

Summary

I investigated the PS1 corruption issue and updated the PR description (#1817) with more precise details about:

What I Found

  1. Evidence Location: Downloaded and analyzed the SWE-Bench-Multimodal evaluation results from eval-21318221216-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_26-01-24-19-46.tar.gz (Jan 24, 2026).

  2. Affected Instances: Found corruption in processing__p5.js-3709 and processing__p5.js-5771.

  3. The Corruption Mechanism:

    • Root Cause: A terminal output race condition between the shell rendering the PS1 prompt (with JSON metadata) and programs outputting to stdout (like grunt's ASCII cat art).
    • When these outputs happen concurrently, they get interleaved due to terminal buffering.
    • Example of corrupted output:
      8 /w-_-_-_-_-_,------,npm test 2>&1 | tail -100
       0   -_-_-_-_-_|   /\_/\
       0   -_-_-_-_-^|__( ^ .^)
      
    • The 8 /w is part of the PS1 JSON ("exit_code": "0" and "working_dir": "/workspace/p5.js"), interleaved with grunt's ASCII cat art.
  4. Full Scope of What Can Be Corrupted:

    • ASCII art (e.g., grunt's cat mascot)
    • Progress bars and spinners
    • ANSI escape sequences
    • Any concurrent stdout output
  5. How the Regex Fails: The non-greedy regex ###PS1JSON###(.*?)###PS1END### matches from the FIRST marker to the ONLY end marker, creating one giant match with corrupted JSON that fails parsing → 0 valid matches → AssertionError.

Changes Made

  • Updated PR fix: handle PS1 metadata corruption in command output #1817 description via GitHub API with:
    • Clear explanation of what PS1 metadata is
    • Detailed description of the race condition mechanism
    • Real evidence from production (archive URL and affected instances)
    • Example of corrupted output from the evaluation logs
    • Explanation of how the regex fails

No code changes were made - only the PR description was updated to be more precise and include evidence from the SWE-Bench-Multimodal evaluation results.

PR Link: #1817

View full conversation

@neubig neubig changed the title fix: handle PS1 metadata corruption from ASCII art in command output fix: handle PS1 metadata corruption in command output Jan 26, 2026
@neubig neubig marked this pull request as ready for review January 26, 2026 13:07
@enyst enyst requested a review from xingyaoww January 26, 2026 13:11
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one critical issue with position handling and a few minor improvements needed. See inline comments for details.

Comment on lines 49 to 53
def start(self, group: int = 0) -> int:
return self._original_match.start(group)

def end(self, group: int = 0) -> int:
return self._original_match.end(group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: These methods return positions from the original corrupted match, not the actual position of the recovered content in the string. This is a critical bug because terminal_session.py uses these positions to slice terminal content:

# From terminal_session.py:
terminal_content[ps1_matches[0].start():ps1_matches[0].end()]
terminal_content[ps1_matches[i].end() + 1 : ps1_matches[i + 1].start()]

With incorrect positions, this will extract the wrong content and break command output parsing.

Suggested fix: Calculate and store the actual positions when creating the synthetic match:

Suggested change
def start(self, group: int = 0) -> int:
return self._original_match.start(group)
def end(self, group: int = 0) -> int:
return self._original_match.end(group)
def start(self, group: int = 0) -> int:
# Return the actual position of recovered content in original string
if group == 0:
return self._actual_start
elif group == 1:
return self._actual_start + len(CMD_OUTPUT_PS1_BEGIN.strip()) + 1
raise IndexError(f"no such group: {group}")
def end(self, group: int = 0) -> int:
# Return the actual end position of recovered content
if group == 0:
return self._actual_end
elif group == 1:
return self._actual_end - len(CMD_OUTPUT_PS1_END.strip()) - 1
raise IndexError(f"no such group: {group}")

You would need to modify the constructor to accept and store _actual_start and _actual_end positions, calculated from original_match.start(1) + last_marker_pos when creating the synthetic match on line 132.

matches.append(_SyntheticMatch(last_block_content, match))
logger.debug(
"Recovered valid PS1 block from corrupted "
f"output: {last_block_content[:80]}..."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: The truncation adds "..." without checking if the content is actually longer than 80 characters.

Suggested change
f"output: {last_block_content[:80]}..."
f"output: {last_block_content[:80]}{'...' if len(last_block_content) > 80 else ''}"


logger.debug(
f"Failed to parse PS1 metadata - Skipping: [{match.group(1)}]"
f"Failed to parse PS1 metadata - Skipping: [{content[:200]}...]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: Same issue - adds "..." without checking length first.

Suggested change
f"Failed to parse PS1 metadata - Skipping: [{content[:200]}...]"
f"Failed to parse PS1 metadata - Skipping: [{content[:200]}{'...' if len(content) > 200 else ''}]"

Comment on lines +174 to +186
"""
Test that grunt's ASCII cat art causes ZERO valid PS1 matches.
This is the ROOT CAUSE of the production error:
"Expected at least one PS1 metadata block, but got 0."
The regex matches one block but the content fails JSON parsing
because it contains ASCII art and a nested ###PS1JSON### marker.
This test FAILS because the current behavior is broken - we get
0 matches when there IS a valid JSON block at the end of the output.
A fix should return at least 1 valid match.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: This docstring is confusing because it says "This test FAILS because the current behavior is broken" but the test actually asserts that the fix IS working (expects >= 1 matches).

Update the docstring to reflect that this test verifies the FIX works correctly:

Suggested change
"""
Test that grunt's ASCII cat art causes ZERO valid PS1 matches.
This is the ROOT CAUSE of the production error:
"Expected at least one PS1 metadata block, but got 0."
The regex matches one block but the content fails JSON parsing
because it contains ASCII art and a nested ###PS1JSON### marker.
This test FAILS because the current behavior is broken - we get
0 matches when there IS a valid JSON block at the end of the output.
A fix should return at least 1 valid match.
"""
"""
Test that the fix correctly recovers valid PS1 blocks from corrupted output.
When grunt's ASCII cat art corrupts the first PS1 block, the fix should
detect the nested ###PS1JSON### marker and extract the valid second block.
This is the ROOT CAUSE of the production error:
"Expected at least one PS1 metadata block, but got 0."
The fix handles this by finding the LAST ###PS1JSON### marker and extracting
the valid JSON that follows it.
"""

@xingyaoww
Copy link
Collaborator

@OpenHands /codereview-roasted can you checkout a new temporary branch from this PR (but do NOT push!), then try to remove all the implementations from this PR. Then tell me, can you reproduce the test failure mentioned in the PR description?

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

Summary: Test Failure Reproduction Confirmed ✅

I followed your instructions to verify whether the test failure mentioned in the PR description can be reproduced:

What I Did:

  1. Created a temporary branch (temp-test-revert-fix) from the PR branch fix/ps1-corruption-test
  2. Reverted the implementation by restoring metadata.py to its original state from main (removing the _SyntheticMatch class and the nested marker detection logic)
  3. Ran the tests to check for failures
  4. Cleaned up by deleting the temporary branch (did NOT push as instructed)

Result: YES, the test failure can be reproduced

When the fix is removed, the test test_corrupted_ps1_causes_zero_valid_matches FAILS with:

AssertionError: BUG: Expected at least 1 valid PS1 match, got 0. 
The output contains a VALID PS1 block at the end, but the regex/parser 
fails to find it. This bug causes 'Expected at least one PS1 metadata 
block, but got 0' errors.

This matches exactly what the PR description claims:

  • Without the fix: 0 valid matches (test fails)
  • With the fix: 1 valid match (test passes)

Conclusion:

The PR's fix is legitimate and addresses a real bug. The _SyntheticMatch class and nested marker detection logic successfully recover valid PS1 metadata from corrupted output where ASCII art (like grunt's cat mascot) gets interleaved with the PS1 prompt.

View full conversation

@neubig neubig marked this pull request as draft January 26, 2026 14:49
@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands read the critical comment from all-hands-bot and write a test to reproduce the "critical bug"

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

Add TestSyntheticMatchPositionBug class with tests that demonstrate the
critical bug where _SyntheticMatch.start() and .end() return positions
from the original corrupted match instead of the recovered content position.

This causes terminal_session.py to extract wrong content when slicing
terminal_content using match positions.

See: #1817 (comment)

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

Summary

I read the critical comment from all-hands-bot on PR #1817 and wrote a test to reproduce the "critical bug" as requested.

The Critical Bug

The all-hands-bot identified that the _SyntheticMatch class's start() and end() methods return positions from the original corrupted match, not the actual position of the recovered content. This is a critical bug because terminal_session.py uses these positions to slice terminal content:

# From terminal_session.py:
terminal_content[ps1_matches[0].start():ps1_matches[0].end()]
terminal_content[ps1_matches[i].end() + 1 : ps1_matches[i + 1].start()]

With incorrect positions, this extracts the wrong content and breaks command output parsing.

Changes Made

Added a new test class TestSyntheticMatchPositionBug in tests/tools/terminal/test_ps1_corruption.py with three tests:

  1. test_synthetic_match_positions_are_incorrect - The main failing test that demonstrates the bug. It shows that slicing the original string with match.start():match.end() gives different content than match.group(0). The sliced content includes the corrupted first block and ASCII art, instead of just the recovered valid PS1 block.

  2. test_content_after_synthetic_match_is_wrong - Tests that content after a synthetic match is extracted correctly.

  3. test_content_before_synthetic_match_includes_corrupted_data - Tests that content before a synthetic match includes the expected data and demonstrates that there's corrupted data between match.start() and the actual recovered content position.

Test Results

FAILED tests/tools/terminal/test_ps1_corruption.py::TestSyntheticMatchPositionBug::test_synthetic_match_positions_are_incorrect
PASSED tests/tools/terminal/test_ps1_corruption.py::TestSyntheticMatchPositionBug::test_content_after_synthetic_match_is_wrong
PASSED tests/tools/terminal/test_ps1_corruption.py::TestSyntheticMatchPositionBug::test_content_before_synthetic_match_includes_corrupted_data

The first test fails as expected, clearly demonstrating the critical bug with a detailed error message showing the mismatch between expected and actual content.

Commit

The changes have been committed and pushed to the fix/ps1-corruption-test branch:

  • Commit: cbe72037 - "test: add failing test for _SyntheticMatch position bug"

The PR #1817 has been updated with this new test that reproduces the critical bug identified by all-hands-bot.

View full conversation

Tests should fail when demonstrating undesirable behavior. Updated
test_synthetic_match_start_returns_correct_position and
test_synthetic_match_end_returns_correct_position to assert the
CORRECT behavior, which will fail until the bug is fixed.

Co-authored-by: openhands <openhands@all-hands.dev>
Renamed test_synthetic_match_positions_are_incorrect to
test_slicing_with_match_positions_returns_group_zero to clearly
describe the correct behavior we expect (which will pass when fixed).

Co-authored-by: openhands <openhands@all-hands.dev>
Remove redundant tests. The single test_slicing_with_match_positions_returns_group_zero
test covers both start() and end() by verifying the fundamental contract that
terminal_content[match.start():match.end()] == match.group(0).

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1817 at branch `fix/ps1-corruption-test`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

- Fix _SyntheticMatch.start() and .end() to return correct positions
  for the recovered content instead of the original corrupted match
- Use original (unstripped) group(1) content to find nested marker position
- Add test_synthetic_match_slicing_returns_group_zero to verify the fix
- Consolidate test into a single function instead of a class

The fix ensures that terminal_content[match.start():match.end()] equals
match.group(0), which is the fundamental contract of a match object.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants