Library + CLI for the ARC Whitebox Estimation Challenge. Generates random ReLU MLPs, runs FLOP-budgeted estimators against Monte Carlo ground truth, and produces score reports.
👉 Start at the whest-starterkit. That repo is the on-ramp: a working estimator.py, four worked examples, stage-by-stage walkthroughs from "just iterate locally" to "package a submission".
For an interactive visualization of small random MLPs and estimator behavior, see the WhestBench Explorer — an in-browser companion that's optional but useful for building intuition.
This repo is the underlying engine. You don't need to clone it directly.
from whestbench import BaseEstimator, MLP, sample_mlp
import flopscope as flops
import flopscope.numpy as fnp
class MyEstimator(BaseEstimator):
def predict(self, mlp: MLP, budget: int) -> fnp.ndarray:
return fnp.zeros((mlp.depth, mlp.width))CLI entry point (registered as both whest and whestbench):
whest validate --estimator path/to/estimator.py
whest run --estimator path/to/estimator.py --runner local
whest doctorSee docs/reference/cli-reference.md for the full command surface.
WhestBench evaluations run against datasets (collections of MLPs with ground-truth statistics). You can bake them locally, publish to HF, and pull back for reproducible scoring. See the datasets guide for the full walkthrough.
# 5-minute path:
whest dataset bake --n-mlps 10 --output ./my-eval
whest run --estimator estimator.py --dataset ./my-evalFor HF-hosted datasets:
whest run --estimator estimator.py \
--dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmupSee docs/reference/dataset-format.md for the schema 3.0 specification.
For baking large ground-truth datasets (n_samples ≥ 10⁸), install the torch
backend extra:
pip install whestbench[gpu]Then use whest dataset bake --torch --device auto .... See
GPU Dataset Generation for details.
For parallel baking across multiple GPUs, see
Parallel bake.
src/whestbench/
├── __init__.py ← public API surface
├── cli.py ← `whest`/`whestbench` entry point
├── concurrency.py ← parallel execution helpers
├── dataset.py ← evaluation dataset I/O (schema 3.0 bake + load)
├── dataset_io.py ← Parquet+sidecar on-disk I/O, merge
├── dataset_torch.py ← GPU/torch backend for dataset baking
├── doctor.py ← `whest doctor` environment checks
├── domain.py ← MLP, SetupContext, scoring spec
├── estimators.py ← BaseEstimator + reference impls (mean/cov/combined)
├── generation.py ← sample_mlp
├── hardware.py ← hardware probing
├── hub.py ← publish_dataset (HF Hub upload)
├── loader.py ← estimator module loading
├── packaging.py ← submission packaging
├── presentation/ ← Rich rendering helpers
├── profiler.py ← FLOP profiler integration
├── protocol.py ← Server runner JSON protocol
├── reporting.py ← Rich score report + smoke panels
├── runner.py ← local/server runner orchestration
├── scoring.py ← evaluate_estimator, ContestSpec
├── sdk.py ← Python SDK surface
├── simulation.py ← Monte Carlo ground truth via flopscope
├── subprocess_worker.py ← isolated estimator subprocess
└── templates/ ← `whest init` + dataset card Jinja2 templates
docs/
├── index.md ← Library/CLI reference index
├── how-to/ ← Task walkthroughs (publish-to-hf-hub, parallel-bake)
└── reference/ ← cli-reference, dataset-format, estimator-contract, ...
Docs are published to https://aicrowd.github.io/whestbench from website/
(Next.js + Fumadocs). API and CLI reference are autogenerated from the code
(scripts/generate_docs.py); participant curriculum is federated at build time
from a pinned commit of whest-starterkit.
- Local preview:
make docs-serve - Full build (+
llms.txt):make docs-build - Coverage gate:
make docs-verify - Update starter-kit pin:
python scripts/bump_starterkit_pin.py(thenmake docs-build+ commit)
llms.txt / llms-full.txt are generated for agent ingestion and served at the site root.
Tagged via release-please. See docs/RELEASING.md.
Underlying FLOP accounting library: AIcrowd/flopscope (replaced the deprecated whest).
See LICENSE.
