feat(benchmarks): Support local Claude UI benchmark suites#429
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Old
sequenceconfig key silently ignored instead of rejected- Added 'sequence' key to rejectRemovedConfigKeys with migration message 'removed; sequence checking is now observational only'
Or push these changes by commenting:
@cursor push 8dbf5edd5a
Preview (8dbf5edd5a)
diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -247,6 +247,7 @@
allowedVariance: 'removed; baselines are observed data only',
expectedFailures: 'removed; benchmark stumbles are observed data',
expectedToolSequence: 'renamed to baselineToolSequence',
+ sequence: 'removed; sequence checking is now observational only',
};
for (const [key, message] of Object.entries(removedKeys)) {
if (raw[key] !== undefined) throw new Error(`${source}.${key}: ${message}`);You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit ee9adf9. Configure here.
Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP and optional local tool surfaces. The harness now records observed baselines, writes per-run artifacts, supports private local suites, and reports completion separately from benchmark metrics. Keep vendor/private suites out of tracked source by discovering ignored local benchmark suites from a generic local directory. Refresh the committed first-party baselines from the latest canonical benchmark runs. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Treat configured failure pattern matches as incomplete benchmark runs so CI exits non-zero for explicitly declared failure conditions. Reject activateSkill configs without skillDirs during suite parsing to avoid late failures after expensive setup. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Reject the old sequence suite config key with an explicit migration message so migrated benchmarks do not silently drop sequence checks. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Validate activated benchmark skills before setup, preserve authoritative Claude stream results only for harness-terminated runs, and make transcript failure accounting robust for missing or duplicate Bash tool results. Co-Authored-By: OpenAI Codex <noreply@openai.com>
989ab76 to
3eaed16
Compare
9c53757 to
2082db2
Compare
Detach timed Claude commands before process-group termination and fix RocketSim preflight launch detection for direct app/path commands. Co-Authored-By: OpenAI Codex <noreply@openai.com>
2082db2 to
cd53c71
Compare
Handle stdin stream errors from benchmark child processes so an early child exit does not crash the harness with an unhandled EPIPE. Co-Authored-By: Codex <noreply@openai.com>
|
For the Bugbot summary about the removed |
commit: |
Allow Claude UI suite discovery helpers to receive suite directories so tests can exercise local suite lookup without writing fake files into the real repository tree. Co-Authored-By: Codex <noreply@openai.com>
Stop the child process when benchmark command stdin fails with a non-EPIPE error, and ignore late stdout/stderr data after the command has settled. Co-Authored-By: Codex <noreply@openai.com>
Tighten transcript failure suppression, validate Claude timeout config, and make aggregate artifact roots path-aware. Co-Authored-By: Codex <noreply@openai.com>
|
Addressed the additional benchmark hardening audit in d1625f5: ignored failure patterns no longer suppress unrelated real failures in the same result, |



Add a local Claude Code benchmark harness for measuring UI task runs against XcodeBuildMCP.
This keeps the committed suite focused on first-party XcodeBuildMCP coverage while allowing private/local benchmark suites to live under a generic ignored
benchmarks/claude-ui/local/tree. The harness records observed benchmark data rather than treating metric deltas as regression failures, writes per-run artifacts, and reports task completion separately from measured tool usage and timing.The committed baselines were refreshed from three canonical full-suite runs. Local/private suite results are intentionally not committed.
This PR is stacked on #427.