Skip to content

Commit fe03f96

Browse files
cameroncookecodex
andcommitted
feat(benchmarks): Add Claude UI benchmark harness
Add a local Claude UI benchmark harness for running deterministic app tasks against the development MCP server. The harness creates temporary simulators, uses isolated MCP config, records tool-call and timing metrics, and reports sequence drift with readable terminal output. Stabilize post-action UI snapshots so mutating UI actions return settled refs before the next agent step. Add benchmark and UI automation tests covering the new harness behavior and snapshot polling. Co-Authored-By: Codex <noreply@openai.com>
1 parent 8da9745 commit fe03f96

31 files changed

Lines changed: 4347 additions & 41 deletions

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
### Added
66

7+
- Added `--from-result` to the Claude UI benchmark harness so existing `result.json` artifacts can be rendered as text or JSON without rerunning Claude.
78
- Added `nextSteps` hint lines to MCP `structuredContent` and CLI `--output json` envelopes so agents can consume follow-up actions without scraping text. CLI JSON renders shell command lines; MCP structured content renders MCP tool-call hints. Structured result schemas that include `nextSteps` now use schema version 2; existing version 1 schema files remain available for current validators.
89
- Added `snapshot_ui sinceScreenHash` / CLI `--since-screen-hash` so callers can skip full runtime snapshot output when the screen hash is unchanged.
910
- Added `batch` for executing multiple AXe UI automation steps in one simulator session.
@@ -14,11 +15,14 @@
1415

1516
### Changed
1617

18+
- Changed Claude UI benchmark suite runs to create a temporary simulator by default and delete only that harness-created simulator after the suite finishes.
19+
- Changed Claude UI benchmark exact tool sequence drift to warn by default, with `sequence.mode: fail` available for strict suites.
1720
- Successful mutating UI automation calls now always attempt to refresh the runtime snapshot after the action instead of preserving or patching cached switch state.
1821
- Runtime snapshot guidance no longer advertises synthetic sheet swipe targets for foreground sheets. Agents should use real sheet grabber expansion and real descendant scroll/list targets with `drag` instead of inferred app/window-root sheet swipes.
1922

2023
### Fixed
2124

25+
- Fixed Claude UI benchmark suite runs so temporary simulators are applied through an isolated per-run MCP config instead of being overridden by repo or example-project config defaults.
2226
- Fixed simulator launch failures before simulator-name resolution so they are not reported as macOS launch failures.
2327
- Fixed CLI JSON output so simulator-name resolution failures return the structured error envelope instead of plain stderr.
2428
- Fixed accessibility hierarchy tips so UI automation guidance prefers runtime element refs over raw coordinate guessing.

benchmarks/claude-ui/README.md

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
# Claude UI benchmark harness
2+
3+
Local/manual harness for running Claude Code against the development XcodeBuildMCP MCP server and auditing UI automation behavior.
4+
5+
The harness:
6+
7+
- reads a suite YAML file from `benchmarks/claude-ui/suites/`
8+
- reads the referenced prompt Markdown file from disk and feeds it to `claude -p`
9+
- creates, boots, waits for, and opens a fresh temporary simulator before Claude launches for each suite run by default
10+
- writes an isolated per-run MCP workspace config with the suite defaults and temporary `simulatorId`
11+
- generates a Claude MCP config pointing at `node build/cli.js mcp` with `XCODEBUILDMCP_CWD` set to that isolated workspace
12+
- optionally preflights configured first-run prompts before Claude launches, outside the measured run
13+
- deletes the temporary simulator at the end of the suite, best effort, using only the ID created by the harness
14+
- writes artifacts under `out.nosync/claude-benchmarks/<suite>/<timestamp>/`
15+
- runs `/Volumes/Developer/parse_claude_conversation.py` against Claude's stream JSONL
16+
- audits tool counts, MCP calls, UI automation calls, wall clock, failures/stumbles, and expected tool sequence drift
17+
- prints a structured per-suite report and (for `--all`) an aggregate summary
18+
- optionally prints machine-readable JSON with `--json`
19+
- can render an existing `result.json` or artifact directory with `--from-result` without rerunning Claude
20+
21+
This is intentionally not part of the normal test suite because it launches Claude and drives local simulators/apps.
22+
23+
## Commands
24+
25+
Build first, then run a suite:
26+
27+
```bash
28+
npm run build
29+
npx tsx benchmarks/claude-ui/run.ts --suite weather
30+
```
31+
32+
Shortcut:
33+
34+
```bash
35+
npm run bench:claude-ui -- --suite weather
36+
```
37+
38+
Run every suite YAML:
39+
40+
```bash
41+
npm run bench:claude-ui -- --all
42+
```
43+
44+
Print machine-readable output from a new run:
45+
46+
```bash
47+
npm run bench:claude-ui -- --suite reminders --json
48+
```
49+
50+
Render an existing result without rerunning Claude:
51+
52+
```bash
53+
npm run bench:claude-ui -- --from-result out.nosync/claude-benchmarks/reminders/20260522T130926Z
54+
npm run bench:claude-ui -- --from-result out.nosync/claude-benchmarks/reminders/20260522T130926Z/result.json --json
55+
```
56+
57+
## Suite YAML shape
58+
59+
```yaml
60+
name: weather
61+
prompt: ../prompts/weather.md
62+
workingDirectory: example_projects/Weather
63+
sessionDefaults:
64+
projectPath: Weather.xcodeproj
65+
scheme: Weather
66+
simulatorName: iPhone 17 Pro Max
67+
temporarySimulator: true
68+
firstRunPromptDismissals:
69+
labels:
70+
- Continue
71+
- Not Now
72+
timeoutSeconds: 12
73+
baseline:
74+
totalToolCalls: 19
75+
mcpToolCalls: 18
76+
uiAutomationCalls: 16
77+
wallClockSeconds: 125
78+
tools:
79+
snapshot_ui: 1
80+
tap: 9
81+
allowedVariance:
82+
totalToolCalls: 2
83+
mcpToolCalls: 2
84+
uiAutomationCalls: 2
85+
wallClockSeconds: 45
86+
toolCalls: 2
87+
expectedToolSequence:
88+
- session_show_defaults
89+
- build_run_sim
90+
- snapshot_ui
91+
sequence:
92+
mode: warn
93+
failurePatterns:
94+
- STALE_ELEMENT_REF
95+
- SNAPSHOT_MISSING
96+
- WAIT_TIMEOUT
97+
```
98+
99+
Variance is an upper bound: lower tool counts or faster runs are accepted, while values above `baseline + allowedVariance` fail. Defaults are `totalToolCalls: 0`, `mcpToolCalls: 0`, `uiAutomationCalls: 0`, `toolCalls: 0`, and `wallClockSeconds: 30`.
100+
101+
Tool sequence drift is warning-only by default (`sequence.mode: warn`) because real Claude runs can choose equally valid UI paths. Use `sequence.mode: fail` only for suites where exact MCP call order is part of the contract.
102+
103+
`sessionDefaults` are written to a harness-owned config at `<run>/mcp-workspace/.xcodebuildmcp/config.yaml`. The generated Claude MCP config sets `XCODEBUILDMCP_CWD` to `<run>/mcp-workspace`, so the dev MCP server reads only the benchmark config instead of any repo or example-project `.xcodebuildmcp/config.yaml`. Unknown keys fail fast. Relative path defaults such as `projectPath`, `workspacePath`, and `derivedDataPath` are resolved against the suite `workingDirectory` before being written because the MCP server cwd is the isolated workspace.
104+
105+
## Temporary simulator lifecycle
106+
107+
By default, each suite creates a fresh simulator before Claude launches. The harness uses `sessionDefaults.simulatorName` as the `simctl create` device type name, captures the returned simulator ID, boots that simulator, waits for `simctl bootstatus <id> -b`, opens Simulator.app to that device, applies a short UI-readiness delay, and writes the simulator ID as `sessionDefaults.simulatorId` in the isolated MCP workspace config. This makes Claude and the dev MCP server target a visible, booted, isolated simulator instead of reusing a previous run's state or spending benchmark calls on simulator boot/open setup.
108+
109+
Simulator setup is deliberately outside the benchmark measurement boundary. The measured `wallClockSeconds` starts when the harness spawns Claude and stops when Claude exits. Tool-call counts are parsed only from Claude's JSONL transcript. The result JSON still records temporary simulator `setupDurationSeconds` under `run.temporarySimulator` so setup cost is visible without being compared against Claude task-efficiency baselines.
110+
111+
Config contract:
112+
113+
- Omit `temporarySimulator` for the default behavior: create and later delete a temporary simulator.
114+
- Set `temporarySimulator: false` to opt out and use the suite/project defaults as-is.
115+
- Set `sessionDefaults.simulatorId` to use an existing simulator. In this case the harness does not create or delete a simulator.
116+
- Do not set both `temporarySimulator: true` and `sessionDefaults.simulatorId`; the harness fails fast because deleting a user-provided simulator would be unsafe.
117+
118+
Temporary simulator setup is required when enabled. If creation, boot, bootstatus, or Simulator.app opening fails, the suite fails loudly before Claude starts. Deletion is best effort in a `finally` block: failures are logged but do not mask the benchmark result or original error.
119+
120+
`firstRunPromptDismissals` is an optional suite-level preflight for fresh simulator noise such as Apple first-run sheets. When configured, the harness launches `sessionDefaults.bundleId` before Claude starts, retries through transient UI-inspection failures, looks for any listed button labels, taps matching labels with AXe, then terminates the app. If the prompt state cannot be inspected or dismissed before `timeoutSeconds`, the suite fails before Claude starts. These preflight interactions are logged in `simulator-lifecycle.log`, but they are outside Claude's wall-clock measurement and do not appear in tool-call counts. Keep the labels generic and non-destructive, for example `Continue`, `Not Now`, or `OK`; do not configure sign-in, sync enablement, Settings, destructive, or data-deletion actions.
121+
122+
Lifecycle details are written to `simulator-lifecycle.log`, including the `create`, `boot`, `bootstatus`, `open`, readiness delay, optional first-run prompt preflight, and deletion steps. `claude-command.log` also records the simulator ID used for the run. The terminal report shows the temporary simulator ID plus setup duration as `setup ... before Claude` when a temporary simulator is used.
123+
124+
## Terminal report
125+
126+
Each suite renders as a structured report with a status banner, aligned metric and tool tables, a failures/stumbles section (only when non-zero), and a sequence diff. When run with `--all`, an aggregate summary follows the per-suite reports.
127+
128+
### Single suite
129+
130+
```text
131+
────────────────────────────────────────────────────────────────────────
132+
PASS weather 1m 38.6s
133+
suite benchmarks/claude-ui/suites/weather.yml
134+
artifacts out.nosync/claude-benchmarks/weather/20260522T214044Z
135+
exit claude=0 parser=0
136+
137+
Metrics
138+
METRIC ACTUAL BASELINE VARIANCE DELTA STATUS
139+
totalToolCalls 13 19 +2 −6 PASS
140+
mcpToolCalls 12 18 +2 −6 PASS
141+
uiAutomationCalls 10 16 +2 −6 PASS
142+
wallClockSeconds 98.62 125.00 +45.00 −26.38 PASS
143+
144+
Tool calls (baseline-tracked)
145+
TOOL ACTUAL BASELINE DELTA STATUS
146+
session_show_defaults 1 1 0 PASS
147+
build_run_sim 1 1 0 PASS
148+
snapshot_ui 1 1 0 PASS
149+
tap 6 9 −3 PASS
150+
batch 1 1 0 PASS
151+
152+
PASS failures/stumbles: 0
153+
```
154+
155+
### Sequence drift
156+
157+
When the tool sequence drifts, the report includes unified-diff style hunks with expected/actual index columns. Drift is warning-only by default, so the overall status stays `WARN` rather than `FAIL`:
158+
159+
```text
160+
WARN tool sequence (warn): drift: 4 missing, 0 additional
161+
@@ expected[8..15] actual[8..11] @@
162+
8 8 tap
163+
9 9 tap
164+
10 − tap
165+
11 10 swipe
166+
12 11 tap
167+
13 − swipe
168+
14 − tap
169+
15 − tap
170+
```
171+
172+
`` lines are expected calls Claude skipped; `+` lines are calls Claude made that were not expected. Dim lines are surrounding context.
173+
174+
### Failures and inspect hints
175+
176+
When `failures/stumbles` is non-zero the report lists the first few tool failures and pattern matches, and surfaces an `Inspect` block with the relevant artifact paths:
177+
178+
```text
179+
FAIL failures/stumbles: 1
180+
• tool failures: 1
181+
boot_sim @ line 9: Boot failed: device not found
182+
183+
Inspect
184+
result.json out.nosync/claude-benchmarks/reminders/20260522T213905Z/result.json
185+
transcript out.nosync/claude-benchmarks/reminders/20260522T213905Z/claude.jsonl
186+
stderr out.nosync/claude-benchmarks/reminders/20260522T213905Z/claude.stderr
187+
run dir out.nosync/claude-benchmarks/reminders/20260522T213905Z
188+
```
189+
190+
### Aggregate summary
191+
192+
After `--all` (or multi-result `--from-result`) the harness appends:
193+
194+
```text
195+
════════════════════════════════════════════════════════════════════════
196+
Claude UI Benchmarks · Summary
197+
════════════════════════════════════════════════════════════════════════
198+
Suites: 3 total · 2 passed · 1 failed · 2 sequence warnings
199+
Duration: total 4m 49.8s · slowest reminders (1m 39.8s)
200+
Artifacts: out.nosync/claude-benchmarks/
201+
202+
! WARN weather 1m 38.6s sequence warn: 4m/0a
203+
✗ FAIL reminders 1m 39.8s 1 stumble · sequence warn: 7m/4a
204+
! WARN contacts 1m 31.4s sequence warn: 2m/2a
205+
════════════════════════════════════════════════════════════════════════
206+
```
207+
208+
`Nm/Ka` denotes "N missing / K additional" calls vs. `expectedToolSequence`.
209+
210+
The renderer auto-detects TTY and adds ANSI color when stdout is a terminal and `NO_COLOR` is unset. Plain-text output (e.g. when piping to a file or under `NO_COLOR=1`) carries the same information without color codes.
211+
212+
`--json` output is unchanged by this renderer: the JSON payload remains a single `BenchmarkResult` for `--suite` / single-result `--from-result`, and an array for `--all` / multi-result `--from-result`.
213+
214+
## Artifacts
215+
216+
Each run writes:
217+
218+
- `prompt.md` — exact suite prompt fed to Claude
219+
- `mcp-config.json` — generated Claude MCP config
220+
- `mcp-workspace/.xcodebuildmcp/config.yaml` — isolated MCP server config with effective suite defaults
221+
- `claude.jsonl` — Claude stream JSON output
222+
- `claude.stderr` — Claude stderr
223+
- `claude-command.log` — command, cwd, simulator ID, exit status, wall clock
224+
- `simulator-lifecycle.log` — temporary simulator create, boot, bootstatus, open, readiness, deletion commands, and simulator ID
225+
- `parsed/` — files written by `parse_claude_conversation.py`
226+
- `parse.log` / `parse.log.stderr` — parser output
227+
- `result.json` — full benchmark result
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Contacts UI benchmark
2+
3+
Task:
4+
1. Launch Contacts on the configured simulator.
5+
2. Create exactly one new contact with these details:
6+
- First name: `MCP`
7+
- Last name: `Contact Benchmark`
8+
- Organization: `XcodeBuildMCP Benchmark`
9+
- Phone: `555-010-4242`
10+
- Email: `mcp.contact.benchmark@example.com`
11+
3. Save the contact.
12+
4. Verify the saved contact by observing the saved contact card only.
13+
14+
Verification rules:
15+
- Do not enter edit mode after saving.
16+
- Do not change, retype, normalize, delete, or clean up any saved contact data during verification.
17+
- Verification means reading the saved card using UI snapshots and, only if needed, a screenshot.
18+
- Phone-number display formatting may differ by locale. Treat the phone as correct if the saved card visibly contains the same digits as `555-010-4242` in any grouping or punctuation.
19+
- Organization casing may differ. Treat it as correct if the saved card visibly contains the same words as `XcodeBuildMCP Benchmark`.
20+
21+
Return a concise final summary of what you created and observed.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Reminders UI benchmark
2+
3+
Task:
4+
1. Launch Reminders on the configured simulator.
5+
2. Create a new list named `MCP Benchmark List`.
6+
3. Add exactly these reminders to `MCP Benchmark List`:
7+
- `Buy milk benchmark`
8+
- `File report benchmark`
9+
- `Call team benchmark`
10+
4. Mark exactly these reminders complete:
11+
- `Buy milk benchmark`
12+
- `Call team benchmark`
13+
5. Leave exactly this reminder incomplete:
14+
- `File report benchmark`
15+
6. Verify the final state of `MCP Benchmark List` by observing the list only: exactly two completed reminders (`Buy milk benchmark`, `Call team benchmark`) and one incomplete reminder (`File report benchmark`).
16+
17+
Verification rules:
18+
- Do not edit, rename, delete, reorder, or clean up reminders or lists during verification.
19+
- Do not create additional reminders or lists.
20+
- Verification means reading the saved list state using UI snapshots and, only if needed, a screenshot.
21+
22+
Return a concise final summary of what you created and observed.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Weather UI benchmark
2+
3+
Task:
4+
1. Build and run the Weather example app on the configured simulator.
5+
2. Open the settings sheet.
6+
3. Change these settings:
7+
- Temperature: °C
8+
- Wind speed: m/s
9+
- Pressure: inHg
10+
- Distance: km
11+
- Atmospheric animations: off
12+
- Severe weather alerts: off
13+
- Reduce transparency: on
14+
4. Search by typing exactly `London`, then select the London result.
15+
5. Verify the main screen shows `London`, `11°`, precipitation `78%`, and visibility `9.7 km`.
16+
6. Open the precipitation details and verify by observing the UI that it shows `78%` chance over the next 24 hours, `10.7 mm` total expected, `6 hrs` hours of rain, `14 km` storm distance, and lightning `None`.
17+
18+
Verification rules:
19+
- Do not change settings, locations, or app data after reaching the precipitation details.
20+
- Verification means reading the visible UI state using UI snapshots and, only if needed, a screenshot.
21+
22+
Return a concise final summary of what you observed.

benchmarks/claude-ui/run.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/usr/bin/env tsx
2+
import { main } from '../../src/benchmarks/claude-ui/harness.ts';
3+
4+
main()
5+
.then((exitCode) => {
6+
process.exitCode = exitCode;
7+
})
8+
.catch((error) => {
9+
console.error(error instanceof Error ? error.message : String(error));
10+
process.exitCode = 1;
11+
});
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: contacts
2+
prompt: ../prompts/contacts.md
3+
workingDirectory: .
4+
sessionDefaults:
5+
bundleId: com.apple.MobileAddressBook
6+
simulatorName: iPhone 17 Pro Max
7+
firstRunPromptDismissals:
8+
labels:
9+
- Continue
10+
- Not Now
11+
- OK
12+
timeoutSeconds: 8
13+
baseline:
14+
totalToolCalls: 14
15+
mcpToolCalls: 13
16+
uiAutomationCalls: 11
17+
wallClockSeconds: 97
18+
tools:
19+
session_show_defaults: 1
20+
launch_app_sim: 1
21+
snapshot_ui: 1
22+
tap: 5
23+
type_text: 5
24+
allowedVariance:
25+
totalToolCalls: 3
26+
mcpToolCalls: 3
27+
uiAutomationCalls: 3
28+
wallClockSeconds: 45
29+
toolCalls: 2
30+
expectedToolSequence:
31+
- session_show_defaults
32+
- launch_app_sim
33+
- snapshot_ui
34+
- tap
35+
- tap
36+
- type_text
37+
- type_text
38+
- type_text
39+
- tap
40+
- type_text
41+
- tap
42+
- type_text
43+
- tap
44+
failurePatterns:
45+
- STALE_ELEMENT_REF
46+
- SNAPSHOT_MISSING
47+
- WAIT_TIMEOUT

0 commit comments

Comments
 (0)