add webbench, chrome-based OS world, and ground truth to web voyager #1057

filip-michalsky · 2025-09-07T18:49:54Z

why

We want to build a best in class agent in stagehand.
Therefore, we need more eval benchmarks.

what changed

Added Web-bench evals dataset
Added a subset of OS World evals - those that can be run in a chrome browser (desktop-based tasks omitted)
added LICENSE noticed to the copied evals tasks
Added ground truth / expected result to some WebVoyager tasks using reference_answer.json from Browser Use public evals repo.

Improvements to pnpm run evals -man to better describe how to run evals.

test plan

Evals should run locally and bb for these new benchmarks.

changeset-bot · 2025-09-07T18:49:57Z

🦋 Changeset detected

Latest commit: 067d013

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

filip-michalsky · 2025-09-07T19:59:09Z

evals/datasets/webbench/results/anthropicfinal.csv

IDK if we want to include these here or just refer to it when we build the evals graphs? cc @miguelg719

like dragging random CSVs in the repo is not great

resolved -will remove results

greptile-apps

Greptile Summary

This PR successfully integrates three major evaluation benchmarks into Stagehand: WebBench, OS World (Chrome tasks only), and WebVoyager with ground truth reference answers. The implementation includes proper licensing, well-structured adapters for data conversion, and comprehensive evaluation logic for each benchmark type.

Key improvements include:

Added WebBench evaluation dataset with 1000+ web automation tasks across 5 categories (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION)
Integrated 47 Chrome-compatible OS World tasks with proper evaluation criteria mapping
Enhanced WebVoyager with ground truth checker using reference answers from Browser Use public eval repository
Comprehensive configuration options for filtering and sampling tasks across all benchmarks
Proper Apache 2.0 and MIT licensing with attribution for external datasets

Confidence score: 4/5

This PR is safe to merge with minimal risk
The implementation is well-structured with proper error handling, licensing compliance, and follows existing patterns. One minor duplicate configuration issue was found but doesn't affect functionality
evals/evals.config.json has a duplicate task entry that should be corrected

_{65 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

evals/datasets/webvoyager/groundTruthChecker.ts

evals/datasets/osworld/raw/2ad9387a-65d8-4e33-ad5b-7580065a27ca.json

evals/suites/webbench.ts

add webbench

280324b

filip-michalsky added 2 commits September 7, 2025 15:14

improve evals manual

dc02448

update manual for evals

d0447fc

filip-michalsky commented Sep 7, 2025

View reviewed changes

add chrome based OS world evals and licenses

9064586

filip-michalsky changed the title ~~add webbench~~ add webbench, chrome-based OS world, and ground truth to web voyager Sep 8, 2025

filip-michalsky added 2 commits September 7, 2025 22:16

add os world chrome tasks and web voyager ground truth

91cc95d

remove logging

a0007da

filip-michalsky requested a review from tkattkat September 8, 2025 16:30

filip-michalsky marked this pull request as ready for review September 8, 2025 16:30

greptile-apps bot reviewed Sep 8, 2025

View reviewed changes

remove results, make gt checking in webvoyager optional. default false

f66f7fa

filip-michalsky requested a review from miguelg719 September 10, 2025 17:18

filip-michalsky added 2 commits September 10, 2025 13:28

merge main

3ab5e56

update evals

90ade18

tkattkat reviewed Sep 12, 2025

View reviewed changes

evals/datasets/webvoyager/groundTruthChecker.ts Outdated Show resolved Hide resolved

tkattkat reviewed Sep 12, 2025

View reviewed changes

evals/datasets/osworld/raw/2ad9387a-65d8-4e33-ad5b-7580065a27ca.json Show resolved Hide resolved

tkattkat reviewed Sep 12, 2025

View reviewed changes

evals/suites/webbench.ts Outdated Show resolved Hide resolved

filip-michalsky added 2 commits September 12, 2025 22:23

use csv parser lib, simplify ground truth checker

150c69f

merge main

067d013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add webbench, chrome-based OS world, and ground truth to web voyager #1057

add webbench, chrome-based OS world, and ground truth to web voyager #1057

Uh oh!

filip-michalsky commented Sep 7, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Sep 7, 2025 •

edited

Loading

Uh oh!

filip-michalsky Sep 7, 2025

Uh oh!

filip-michalsky Sep 7, 2025

Uh oh!

filip-michalsky Sep 8, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

add webbench, chrome-based OS world, and ground truth to web voyager #1057

Are you sure you want to change the base?

add webbench, chrome-based OS world, and ground truth to web voyager #1057

Uh oh!

Conversation

filip-michalsky commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

filip-michalsky Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

filip-michalsky Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

filip-michalsky Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

filip-michalsky commented Sep 7, 2025 •

edited

Loading

changeset-bot bot commented Sep 7, 2025 •

edited

Loading