ARC Whitebox Estimation Challenge — `whestbench`

ARC Whitebox Estimation Challenge — `whestbench`

Library + CLI for the ARC Whitebox Estimation Challenge. Generates random ReLU MLPs, runs FLOP-budgeted estimators against Monte Carlo ground truth, and produces score reports.

For participants

👉 Start at the whest-starterkit. That repo is the on-ramp: a working estimator.py, four worked examples, stage-by-stage walkthroughs from "just iterate locally" to "package a submission".

For an interactive visualization of small random MLPs and estimator behavior, see the WhestBench Explorer — an in-browser companion that's optional but useful for building intuition.

This repo is the underlying engine. You don't need to clone it directly.

For library / CLI users

from whestbench import BaseEstimator, MLP, sample_mlp
import flopscope as flops
import flopscope.numpy as fnp


class MyEstimator(BaseEstimator):
    def predict(self, mlp: MLP, budget: int) -> fnp.ndarray:
        return fnp.zeros((mlp.depth, mlp.width))

CLI entry point (registered as both whest and whestbench):

whest validate --estimator path/to/estimator.py
whest run --estimator path/to/estimator.py --runner local
whest doctor

See docs/reference/cli-reference.md for the full command surface.

Datasets

WhestBench evaluations run against datasets (collections of MLPs with ground-truth statistics). You can bake them locally, publish to HF, and pull back for reproducible scoring. See the datasets guide for the full walkthrough.

# 5-minute path:
whest dataset bake --n-mlps 10 --output ./my-eval
whest run --estimator estimator.py --dataset ./my-eval

For HF-hosted datasets:

whest run --estimator estimator.py \
          --dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup

See docs/reference/dataset-format.md for the schema 3.0 specification.

Optional GPU backend

For baking large ground-truth datasets (n_samples ≥ 10⁸), install the torch backend extra:

pip install whestbench[gpu]

Then use whest dataset bake --torch --device auto .... See GPU Dataset Generation for details. For parallel baking across multiple GPUs, see Parallel bake.

Repository layout

src/whestbench/
├── __init__.py            ← public API surface
├── cli.py                 ← `whest`/`whestbench` entry point
├── concurrency.py         ← parallel execution helpers
├── dataset.py             ← evaluation dataset I/O (schema 3.0 bake + load)
├── dataset_io.py          ← Parquet+sidecar on-disk I/O, merge
├── dataset_torch.py       ← GPU/torch backend for dataset baking
├── doctor.py              ← `whest doctor` environment checks
├── domain.py              ← MLP, SetupContext, scoring spec
├── estimators.py          ← BaseEstimator + reference impls (mean/cov/combined)
├── generation.py          ← sample_mlp
├── hardware.py            ← hardware probing
├── hub.py                 ← publish_dataset (HF Hub upload)
├── loader.py              ← estimator module loading
├── packaging.py           ← submission packaging
├── presentation/          ← Rich rendering helpers
├── profiler.py            ← FLOP profiler integration
├── protocol.py            ← Server runner JSON protocol
├── reporting.py           ← Rich score report + smoke panels
├── runner.py              ← local/server runner orchestration
├── scoring.py             ← evaluate_estimator, ContestSpec
├── sdk.py                 ← Python SDK surface
├── simulation.py          ← Monte Carlo ground truth via flopscope
├── subprocess_worker.py   ← isolated estimator subprocess
└── templates/             ← `whest init` + dataset card Jinja2 templates
docs/
├── index.md               ← Library/CLI reference index
├── how-to/                ← Task walkthroughs (publish-to-hf-hub, parallel-bake)
└── reference/             ← cli-reference, dataset-format, estimator-contract, ...

Documentation site

Docs are published to https://aicrowd.github.io/whestbench from website/ (Next.js + Fumadocs). API and CLI reference are autogenerated from the code (scripts/generate_docs.py); participant curriculum is federated at build time from a pinned commit of whest-starterkit.

Local preview: make docs-serve
Full build (+ llms.txt): make docs-build
Coverage gate: make docs-verify
Update starter-kit pin: python scripts/bump_starterkit_pin.py (then make docs-build + commit)

llms.txt / llms-full.txt are generated for agent ingestion and served at the site root.

Releases

Tagged via release-please. See docs/RELEASING.md.

Underlying FLOP accounting library: AIcrowd/flopscope (replaced the deprecated whest).

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1,094 Commits
.githooks		.githooks
.github/workflows		.github/workflows
assets/logo		assets/logo
docs		docs
scripts		scripts
src/whestbench		src/whestbench
tests		tests
website		website
.gitignore		.gitignore
.gitlint		.gitlint
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARC Whitebox Estimation Challenge — `whestbench`

For participants

For library / CLI users

Datasets

Optional GPU backend

Repository layout

Documentation site

Releases

License

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARC Whitebox Estimation Challenge — whestbench

For participants

For library / CLI users

Datasets

Optional GPU backend

Repository layout

Documentation site

Releases

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

ARC Whitebox Estimation Challenge — `whestbench`

Packages