feat(benchmarks): Support local Claude UI benchmark suites by cameroncooke · Pull Request #429 · getsentry/XcodeBuildMCP

cameroncooke · 2026-05-26T01:28:51Z

Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP.

This keeps the committed suite focused on first-party XcodeBuildMCP coverage while allowing private/local benchmark suites to live under a generic ignored benchmarks/claude-ui/local/ tree. The harness records observed benchmark data rather than treating metric deltas as regression failures, writes per-run artifacts, and reports task completion separately from measured tool usage and timing.

The committed baselines were refreshed from three canonical full-suite runs. Local/private suite results are intentionally not committed.

This PR is stacked on #427.

cameroncooke · 2026-05-26T01:29:06Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Old sequence config key silently ignored instead of rejected
- Added 'sequence' key to rejectRemovedConfigKeys with migration message 'removed; sequence checking is now observational only'

Or push these changes by commenting:

@cursor push 8dbf5edd5a

Preview (8dbf5edd5a)

diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -247,6 +247,7 @@
     allowedVariance: 'removed; baselines are observed data only',
     expectedFailures: 'removed; benchmark stumbles are observed data',
     expectedToolSequence: 'renamed to baselineToolSequence',
+    sequence: 'removed; sequence checking is now observational only',
   };
   for (const [key, message] of Object.entries(removedKeys)) {
     if (raw[key] !== undefined) throw new Error(`${source}.${key}: ${message}`);

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit ee9adf9. Configure here.}

Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP and optional local tool surfaces. The harness now records observed baselines, writes per-run artifacts, supports private local suites, and reports completion separately from benchmark metrics. Keep vendor/private suites out of tracked source by discovering ignored local benchmark suites from a generic local directory. Refresh the committed first-party baselines from the latest canonical benchmark runs. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Treat configured failure pattern matches as incomplete benchmark runs so CI exits non-zero for explicitly declared failure conditions. Reject activateSkill configs without skillDirs during suite parsing to avoid late failures after expensive setup. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Reject the old sequence suite config key with an explicit migration message so migrated benchmarks do not silently drop sequence checks. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Validate activated benchmark skills before setup, preserve authoritative Claude stream results only for harness-terminated runs, and make transcript failure accounting robust for missing or duplicate Bash tool results. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Detach timed Claude commands before process-group termination and fix RocketSim preflight launch detection for direct app/path commands. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Handle stdin stream errors from benchmark child processes so an early child exit does not crash the harness with an unhandled EPIPE. Co-Authored-By: Codex <noreply@openai.com>

cameroncooke · 2026-05-26T18:44:25Z

For the Bugbot summary about the removed sequence key: this is already addressed in b8e9353. The config loader now rejects sequence with a migration message instead of silently ignoring it.

pkg-pr-new · 2026-05-26T18:45:29Z

Open in StackBlitz

npm i https://pkg.pr.new/xcodebuildmcp@429

commit: 1a06b6d

Allow Claude UI suite discovery helpers to receive suite directories so tests can exercise local suite lookup without writing fake files into the real repository tree. Co-Authored-By: Codex <noreply@openai.com>

Stop the child process when benchmark command stdin fails with a non-EPIPE error, and ignore late stdout/stderr data after the command has settled. Co-Authored-By: Codex <noreply@openai.com>

Tighten transcript failure suppression, validate Claude timeout config, and make aggregate artifact roots path-aware. Co-Authored-By: Codex <noreply@openai.com>

cameroncooke · 2026-05-26T20:00:24Z

Addressed the additional benchmark hardening audit in d1625f5: ignored failure patterns no longer suppress unrelated real failures in the same result, claude.maxClaudeSeconds now rejects non-finite/non-positive values, and aggregate artifact roots now use path-aware containment instead of string-prefix comparison.

cameroncooke mentioned this pull request May 26, 2026

feat(ui-automation): Add rs/1 runtime automation parity #416

Merged

cameroncooke mentioned this pull request May 26, 2026

feat(benchmarks): Add Claude UI benchmark harness #427

Merged

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

cursor Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/config.ts

sentry Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

Comment thread src/benchmarks/claude-ui/harness.ts

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts Outdated

Comment thread src/benchmarks/claude-ui/transcript.ts

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/transcript.ts

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/simulator-deletion.ts Outdated

sentry Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

Comment thread src/benchmarks/claude-ui/transcript.ts

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/preflight-commands.ts Outdated

cameroncooke changed the title ~~feat(benchmarks): Add Claude UI benchmark harness~~ feat(benchmarks): Support local Claude UI benchmark suites May 26, 2026

cameroncooke changed the base branch from cam/feat/claude-ui-benchmark-harness to graphite-base/429 May 26, 2026 11:23

cameroncooke and others added 4 commits May 26, 2026 11:23

fix(benchmarks): Reject removed sequence config key

b8e9353

Reject the old sequence suite config key with an explicit migration message so migrated benchmarks do not silently drop sequence checks. Co-Authored-By: OpenAI Codex <noreply@openai.com>

cameroncooke force-pushed the graphite-base/429 branch from 989ab76 to 3eaed16 Compare May 26, 2026 11:23

cameroncooke force-pushed the cam/feat/configurable-claude-ui-benchmark-tools branch from 9c53757 to 2082db2 Compare May 26, 2026 11:23

graphite-app Bot changed the base branch from graphite-base/429 to main May 26, 2026 11:24

fix(benchmarks): Harden Claude UI process cleanup

cd53c71

Detach timed Claude commands before process-group termination and fix RocketSim preflight launch detection for direct app/path commands. Co-Authored-By: OpenAI Codex <noreply@openai.com>

cameroncooke force-pushed the cam/feat/configurable-claude-ui-benchmark-tools branch from 2082db2 to cd53c71 Compare May 26, 2026 11:24

sentry Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

Comment thread src/benchmarks/claude-ui/transcript.ts

fix(benchmarks): Handle Claude command stdin pipe errors

efffe69

Handle stdin stream errors from benchmark child processes so an early child exit does not crash the harness with an unhandled EPIPE. Co-Authored-By: Codex <noreply@openai.com>

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts

fix(benchmarks): Isolate local suite discovery tests

1a06b6d

Allow Claude UI suite discovery helpers to receive suite directories so tests can exercise local suite lookup without writing fake files into the real repository tree. Co-Authored-By: Codex <noreply@openai.com>

sentry Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

cameroncooke and others added 2 commits May 26, 2026 20:46

fix(benchmarks): Terminate Claude command on stdin errors

2889338

Stop the child process when benchmark command stdin fails with a non-EPIPE error, and ignore late stdout/stderr data after the command has settled. Co-Authored-By: Codex <noreply@openai.com>

fix(benchmarks): Harden Claude UI benchmark validation

d1625f5

Tighten transcript failure suppression, validate Claude timeout config, and make aggregate artifact roots path-aware. Co-Authored-By: Codex <noreply@openai.com>

cameroncooke merged commit fe64572 into main May 26, 2026
28 checks passed

cameroncooke deleted the cam/feat/configurable-claude-ui-benchmark-tools branch May 26, 2026 20:01

Uh oh!

Conversation

cameroncooke commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cameroncooke commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cameroncooke commented May 26, 2026

Uh oh!

pkg-pr-new Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cameroncooke commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cameroncooke commented May 26, 2026 •

edited

Loading

cameroncooke commented May 26, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

pkg-pr-new Bot commented May 26, 2026 •

edited

Loading