Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

📰 News

[May 07, 2025]: The full datasets of Version 1.0 are released (homepage, huggingface)!

👋 Overview

Workspace-Bench is a benchmark for evaluating AI agents on workspace tasks with large-scale file dependencies. It is built to study a capability we call Workspace Learning: whether an agent can identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a real worker's workspace.

💫 LeaderBoard

Rubrics success rate across agent settings

Rubric pass rates on Workspace-Bench-Lite across multiple combinations of agent harnesses and backbone LLMs See Details.

💽 Dataset Introduction

Workspace-Bench contains:

5 realistic worker profiles: Operations Manager, Logistics Manager, AI Product Manager, Researcher, and Backend Developer
74 file types across heterogeneous workspace environments
20,476 files, with workspaces scaling up to 20GB
388 tasks, each paired with an explicit file dependency graph
7,399 fine-grained rubrics for evaluation
Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation cost by about 70%

🚀 Quick Start

Coming soon.

We will release the dataset, evaluation pipeline, and example usage instructions for running agents on Workspace-Bench and Workspace-Bench-Lite. The public release will include the necessary task assets, output specifications, and benchmarking scripts.

🔎 Publications

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

@misc{tang2026workspacebench10benchmarkingai,
      title={Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies}, 
      author={Zirui Tang and Xuanhe Zhou and Yumou Liu and Linchun Li and Weizheng Wang and Hongzhang Huang and Jun Zhou and Jiachen Song and Shaoli Yu and Jinqi Wang and Zihang Zhou and Hongyi Zhou and Yuting Lv and Jinyang Li and Jiashuo Liu and Ruoyu Chen and Chunwei Liu and GuoLiang Li and Jihua Kang and Fan Wu},
      year={2026},
      eprint={2605.03596},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.03596}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

📰 News

👋 Overview

💫 LeaderBoard

💽 Dataset Introduction

🚀 Quick Start

🔎 Publications

🤝 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Workspace-Bench

Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

📰 News

👋 Overview

💫 LeaderBoard

💽 Dataset Introduction

🚀 Quick Start

🔎 Publications

🤝 Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages