Skip to content

Conversation

filip-michalsky
Copy link
Collaborator

@filip-michalsky filip-michalsky commented Sep 7, 2025

why

We want to build a best in class agent in stagehand.
Therefore, we need more eval benchmarks.

what changed

  • Added Web-bench evals dataset
  • Added a subset of OS World evals - those that can be run in a chrome browser (desktop-based tasks omitted)
  • added LICENSE noticed to the copied evals tasks
  • Added ground truth / expected result to some WebVoyager tasks using reference_answer.json from Browser Use public evals repo.

Improvements to pnpm run evals -man to better describe how to run evals.

test plan

Evals should run locally and bb for these new benchmarks.

Copy link

changeset-bot bot commented Sep 7, 2025

🦋 Changeset detected

Latest commit: 067d013

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if we want to include these here or just refer to it when we build the evals graphs? cc @miguelg719

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like dragging random CSVs in the repo is not great

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved -will remove results

@filip-michalsky filip-michalsky changed the title add webbench add webbench, chrome-based OS world, and ground truth to web voyager Sep 8, 2025
@filip-michalsky filip-michalsky marked this pull request as ready for review September 8, 2025 16:30
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR successfully integrates three major evaluation benchmarks into Stagehand: WebBench, OS World (Chrome tasks only), and WebVoyager with ground truth reference answers. The implementation includes proper licensing, well-structured adapters for data conversion, and comprehensive evaluation logic for each benchmark type.

Key improvements include:

  • Added WebBench evaluation dataset with 1000+ web automation tasks across 5 categories (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION)
  • Integrated 47 Chrome-compatible OS World tasks with proper evaluation criteria mapping
  • Enhanced WebVoyager with ground truth checker using reference answers from Browser Use public eval repository
  • Comprehensive configuration options for filtering and sampling tasks across all benchmarks
  • Proper Apache 2.0 and MIT licensing with attribution for external datasets

Confidence score: 4/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured with proper error handling, licensing compliance, and follows existing patterns. One minor duplicate configuration issue was found but doesn't affect functionality
  • evals/evals.config.json has a duplicate task entry that should be corrected

65 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants