-
Notifications
You must be signed in to change notification settings - Fork 1k
add webbench, chrome-based OS world, and ground truth to web voyager #1057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: 067d013 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK if we want to include these here or just refer to it when we build the evals graphs? cc @miguelg719
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like dragging random CSVs in the repo is not great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved -will remove results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR successfully integrates three major evaluation benchmarks into Stagehand: WebBench, OS World (Chrome tasks only), and WebVoyager with ground truth reference answers. The implementation includes proper licensing, well-structured adapters for data conversion, and comprehensive evaluation logic for each benchmark type.
Key improvements include:
- Added WebBench evaluation dataset with 1000+ web automation tasks across 5 categories (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION)
- Integrated 47 Chrome-compatible OS World tasks with proper evaluation criteria mapping
- Enhanced WebVoyager with ground truth checker using reference answers from Browser Use public eval repository
- Comprehensive configuration options for filtering and sampling tasks across all benchmarks
- Proper Apache 2.0 and MIT licensing with attribution for external datasets
Confidence score: 4/5
- This PR is safe to merge with minimal risk
- The implementation is well-structured with proper error handling, licensing compliance, and follows existing patterns. One minor duplicate configuration issue was found but doesn't affect functionality
- evals/evals.config.json has a duplicate task entry that should be corrected
65 files reviewed, no comments
why
We want to build a best in class agent in stagehand.
Therefore, we need more eval benchmarks.
what changed
Improvements to
pnpm run evals -man
to better describe how to run evals.test plan
Evals should run locally and bb for these new benchmarks.