Support for VisualWebArena evaluation in OpenHands (#4773)

Co-authored-by: Xingyao Wang <[email protected]> Co-authored-by: openhands <[email protected]> Co-authored-by: Graham Neubig <[email protected]>
All-Hands-AI · Jan 23, 2025 · aebb583 · aebb583
1 parent 2ff9ba1
commit aebb583
Show file tree

Hide file tree

Showing 16 changed files with 1,063 additions and 19 deletions.
diff --git a/.github/workflows/integration-runner.yml b/.github/workflows/integration-runner.yml
@@ -160,7 +160,6 @@ jobs:
           echo "api_key = \"$LLM_API_KEY\"" >> config.toml
           echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
           echo "temperature = 0.0" >> config.toml
-
       - name: Run integration test evaluation for DelegatorAgent (DeepSeek)
         env:
           SANDBOX_FORCE_REBUILD_RUNTIME: True
@@ -174,12 +173,42 @@ jobs:
           cat $REPORT_FILE_DELEGATOR_DEEPSEEK >> $GITHUB_ENV
           echo >> $GITHUB_ENV
           echo "EOF" >> $GITHUB_ENV
+      # -------------------------------------------------------------
+      # Run VisualBrowsingAgent tests for DeepSeek, limited to t05 and t06
+      - name: Wait a little bit (again)
+        run: sleep 5
+
+      - name: Configure config.toml for testing VisualBrowsingAgent (DeepSeek)
+        env:
+          LLM_MODEL: "litellm_proxy/deepseek-chat"
+          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
+          LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          MAX_ITERATIONS: 15
+        run: |
+          echo "[llm.eval]" > config.toml
+          echo "model = \"$LLM_MODEL\"" >> config.toml
+          echo "api_key = \"$LLM_API_KEY\"" >> config.toml
+          echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
+          echo "temperature = 0.0" >> config.toml
+      - name: Run integration test evaluation for VisualBrowsingAgent (DeepSeek)
+        env:
+          SANDBOX_FORCE_REBUILD_RUNTIME: True
+        run: |
+          poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD VisualBrowsingAgent '' 15 $N_PROCESSES "t05_simple_browsing,t06_github_pr_browsing.py" 'visualbrowsing_deepseek_run'
+
+          # Find and export the visual browsing agent test results
+          REPORT_FILE_VISUALBROWSING_DEEPSEEK=$(find evaluation/evaluation_outputs/outputs/integration_tests/VisualBrowsingAgent/deepseek*_maxiter_15_N* -name "report.md" -type f | head -n 1)
+          echo "REPORT_FILE_VISUALBROWSING_DEEPSEEK: $REPORT_FILE_VISUALBROWSING_DEEPSEEK"
+          echo "INTEGRATION_TEST_REPORT_VISUALBROWSING_DEEPSEEK<<EOF" >> $GITHUB_ENV
+          cat $REPORT_FILE_VISUALBROWSING_DEEPSEEK >> $GITHUB_ENV
+          echo >> $GITHUB_ENV
+          echo "EOF" >> $GITHUB_ENV
 
       - name: Create archive of evaluation outputs
         run: |
           TIMESTAMP=$(date +'%y-%m-%d-%H-%M')
           cd evaluation/evaluation_outputs/outputs  # Change to the outputs directory
-          tar -czvf ../../../integration_tests_${TIMESTAMP}.tar.gz integration_tests/CodeActAgent/* integration_tests/DelegatorAgent/*  # Only include the actual result directories
+          tar -czvf ../../../integration_tests_${TIMESTAMP}.tar.gz integration_tests/CodeActAgent/* integration_tests/DelegatorAgent/* integration_tests/VisualBrowsingAgent/* # Only include the actual result directories
 
       - name: Upload evaluation results as artifact
         uses: actions/upload-artifact@v4
@@ -227,4 +256,7 @@ jobs:
               **Integration Tests Report Delegator (DeepSeek)**
               ${{ env.INTEGRATION_TEST_REPORT_DELEGATOR_DEEPSEEK }}
               ---
+              **Integration Tests Report VisualBrowsing (DeepSeek)**
+              ${{ env.INTEGRATION_TEST_REPORT_VISUALBROWSING_DEEPSEEK }}
+              ---
               Download testing outputs (includes both Haiku and DeepSeek results): [Download](${{ steps.upload_results_artifact.outputs.artifact-url }})
diff --git a/evaluation/benchmarks/visualwebarena/README.md b/evaluation/benchmarks/visualwebarena/README.md
@@ -0,0 +1,50 @@
+# VisualWebArena Evaluation with OpenHands Browsing Agents
+
+This folder contains evaluation for [VisualWebArena](https://github.com/web-arena-x/visualwebarena) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on realistic web browsing tasks.
+
+## Setup Environment and LLM Configuration
+
+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
+
+## Setup VisualWebArena Environment
+
+VisualWebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenHands agents.
+Follow [this document](https://github.com/web-arena-x/visualwebarena/blob/main/environment_docker/README.md) to set up your own VisualWebArena environment through local servers or AWS EC2 instances.
+Take note of the base URL (`$VISUALWEBARENA_BASE_URL`) of the machine where the environment is installed.
+
+## Test if your environment works
+
+Access with browser the above VisualWebArena website URLs and see if they load correctly.
+If you cannot access the website, make sure the firewall allows public access of the aforementioned ports on your server
+Check the network security policy if you are using an AWS machine.
+Follow the VisualWebArena environment setup guide carefully, and make sure the URL fields are populated with the correct base URL of your server.
+
+## Run Evaluation
+
+```bash
+export VISUALWEBARENA_BASE_URL=<YOUR_SERVER_URL_HERE>
+export OPENAI_API_KEY="yourkey" # this OpenAI API key is required for some visualWebArena validators that utilize LLMs
+export OPENAI_BASE_URL="https://api.openai.com/v1/" # base URL for OpenAI model used for VisualWebArena evaluation
+bash evaluation/benchmarks/visualwebarena/scripts/run_infer.sh llm.claude HEAD VisualBrowsingAgent
+```
+
+Results will be in `evaluation/evaluation_outputs/outputs/visualwebarena/`
+
+To calculate the success rate, run:
+
+```sh
+poetry run python evaluation/benchmarks/visualwebarena/get_success_rate.py evaluation/evaluation_outputs/outputs/visualwebarena/SOME_AGENT/EXP_NAME/output.jsonl
+```
+
+## Submit your evaluation results
+
+You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
+
+## VisualBrowsingAgent V1.0 result
+
+Tested on VisualBrowsingAgent V1.0
+
+VisualWebArena, 910 tasks (high cost, single run due to fixed task), max step 15. Resolve rates are:
+
+- GPT4o: 26.15%
+- Claude-3.5 Sonnet: 25.27%
diff --git a/evaluation/benchmarks/visualwebarena/__init__.py b/evaluation/benchmarks/visualwebarena/__init__.py
diff --git a/evaluation/benchmarks/visualwebarena/get_success_rate.py b/evaluation/benchmarks/visualwebarena/get_success_rate.py
@@ -0,0 +1,40 @@
+import argparse
+import json
+
+import browsergym.visualwebarena  # noqa F401 register visualwebarena tasks as gym environments
+import gymnasium as gym
+
+parser = argparse.ArgumentParser(description='Calculate average reward.')
+parser.add_argument('output_path', type=str, help='path to output.jsonl')
+
+args = parser.parse_args()
+
+if __name__ == '__main__':
+    env_ids = [
+        id
+        for id in gym.envs.registry.keys()
+        if id.startswith('browsergym/visualwebarena')
+    ]
+    total_num = len(env_ids)
+    print('Total number of tasks: ', total_num)
+    total_reward = 0
+    total_cost = 0
+    actual_num = 0
+    with open(args.output_path, 'r') as f:
+        for line in f:
+            data = json.loads(line)
+            actual_num += 1
+            total_cost += data['metrics']['accumulated_cost']
+            reward = data['test_result']['reward']
+            if reward >= 0:
+                total_reward += data['test_result']['reward']
+            else:
+                actual_num -= 1
+    avg_reward = total_reward / total_num
+    print('Total reward: ', total_reward)
+    print('Success Rate: ', avg_reward)
+
+    avg_cost = total_cost / actual_num
+    print('Avg Cost: ', avg_cost)
+    print('Total Cost: ', total_cost)
+    print('Actual number of tasks finished: ', actual_num)