Skip to content

wq-will/SimpleTES

Repository files navigation

WILL

SimpleTES

Test-time compute, spent where it actually matters.

Website Paper License: AGPL v3 Stars Issues

SimpleTES test-time scaling overview

SimpleTES (Simple Test-time Evaluation-driven Scaling) scales the propose → evaluate → refine loop for scientific discovery. It combines parallel exploration, feedback-driven refinement, and local selection. Open-source gpt-oss models reach state-of-the-art on 21 problems across six domains, beating both frontier models and tuned optimization pipelines.

Updates

Highlight Results

Domain Highlights Artifacts
Quantum circuit compilation Routing policies beating strong handcrafted baselines on SABRE / QASMBench best_results/quantum_circuit_compilation/
GPU kernel optimization TriMul, batched cumsum, asymmetric matmul kernels best_results/gpu_kernel_optimization/
Algorithm engineering LASSO path solver out-performing expert baselines (2× speedup) best_results/algorithm_engineering/
Mathematics — extremal analysis New Erdős min-overlap & autocorrelation constructions best_results/mathematics_extremal_analysis/
Combinatorial construction SOTA sum-difference, circle packing, Hadamard determinant best_results/combinatorial_construction/
Data science Better scaling laws & single-cell RNA denoising best_results/data_science/

How It Works

A fixed evaluator budget is spent across four levers:

Lever Knob Controls
C --num-chains Parallel exploration — try genuinely different directions
L implied by total budget Feedback-driven refinement depth per chain
K --k-candidates Local best-of-K — avoid committing weak candidates
Φ --selector History-to-prompt policy — what past evidence shapes the next attempt

Each trajectory keeps a history of (candidate, score, metadata). One step: select history, build prompt, ask for K candidates, evaluate in isolated subprocesses, commit the best.

Available selectors (uv run python main.py --list-policies):

Selector Style Best for
balance (default) Stratified sampling Robust default; low-config exploration
puct PUCT scoring Tree-search flavored selection
rpucg DAG-aware, γ-decay Paper-style; strongest single selector
llm_elite Bounded elite pool LLM-managed per-chain population
llm_puct / llm_rpucg Hybrid prefilter + LLM Best for noisy chains / rich DAG histories

Quickstart

Python ≥ 3.11. Install with uv:

uv sync
uv sync --extra vllm        # optional: vLLM token-forcing backend

Or pip install -e ..

Set credentials for any LiteLLM-supported provider:

export GEMINI_API_KEY=...      # or OPENAI_API_KEY / ANTHROPIC_API_KEY / ...

Interactive launcher (discovers tasks, prompts for model / budget / selector, prints or runs the command):

uv run python main_wizard.py

Direct CLI:

uv run python main.py \
  --init-program  datasets/circle_packing/circle_packing_26/init_program.py \
  --evaluator     datasets/circle_packing/circle_packing_26/evaluator.py \
  --instruction   datasets/circle_packing/circle_packing_26/circle_packing_26.txt \
  --model         gemini/gemini-2.0-flash \
  --selector      rpucg \
  --max-generations 50

Resume from an instance directory:

uv run python main.py --resume checkpoints/<date>/instance-<id>

All flags: uv run python main.py --help.

Configuration

📋 Most-tuned flags — click to expand
Goal Flag
Total search budget --max-generations
More directions --num-chains
Less myopic local picks --k-candidates
Change history-to-prompt strategy --selector
Per-chain in-flight cap (concurrency) --backpressure-multiplier
Split throughput knobs --gen-concurrency, --eval-concurrency
Early stop when score reached --early-stop-score
Inspirations per prompt (or a sampled range) --num-inspirations / --min-inspirations-cnt / --max-inspirations-cnt
Switch LLM backend --llm-backend litellm (default) or --llm-backend vllm_token_forcing
Use a task-local Python env --eval-venv <path> (auto-detected for datasets/<family>/venv/)
Skip the 1-token LLM ping at startup --skip-preflight
🎚️ Selector-specific flags — click to expand
Selector Flag
balance --exploitation-ratio / --exploration-ratio / --elite-ratio
puct --puct-c
rpucg --rpucg-gamma
llm_elite / llm_puct / llm_rpucg --llm-policy-model / --llm-policy-api-base / --llm-policy-api-key / --llm-policy-pool-size
💾 Checkpoint & resume — click to expand

Checkpoints land under --output-path (default checkpoints/) every --log-interval evaluations. Each run gets a <date>/instance-<id>/ directory. Resume with --resume pointed at that instance directory:

uv run python main.py --resume checkpoints/<date>/instance-<id>

--save-llm-io keeps full LLM input/output (large files). --gzip compresses checkpoint nodes.

Build Your Own Task

SimpleTES ships with 13 task families across 6 domains. A new task is three files:

my_family/
  my_task/
    init_program.{py|cpp|rs|...}      # seed; mark the evolved region with EVOLVE-BLOCK
    evaluator.py                      # def evaluate(filepath) -> {"combined_score": ..., ...}
    my_task.txt                       # instruction shown to the model

Drop the directory under datasets/ and main_wizard.py picks it up.

→ Catalogue + design guide: datasets/README.md.

Contributing

Contributions are welcome.

[Report a Bug] | [Suggest a Feature] | [Open a PR] | [Add a Task]

Troubleshooting

The most common issues and how to resolve them are listed below.
Symptom Action
LLM preflight failed Check provider credentials, model string, and API base. Use --skip-preflight to bypass while the backend is still warming up.
No checkpoints found on resume Pass the instance directory (e.g. checkpoints/2026-05-24/instance-0), not its parent date directory.
Task complains about missing files uv run python scripts/prepare_task.py --list to see what's available, then --task <family> to fetch / build.
Evaluations error on imports Pin the task-local venv with --eval-venv <path>; SimpleTES auto-detects datasets/<family>/venv/ when present.
Hitting rate limits Lower --gen-concurrency, raise --retry, or switch to a lower-latency model.
Evaluations time out Raise --eval-timeout for slow compilers / simulators. The task-level default lives in each evaluator as TIMEOUT_SECONDS and can be overridden per-evaluation via EVALUATOR_TIMEOUT_SECONDS.
GPU-kernel tasks (gpumode, kernelbench) hang Make sure the compiler server is running first — see the family README.md for the launch command.
fcntl import error (Windows) The registry script scripts/evolve_db_registry.py is POSIX-only by design. Other tasks run fine on Windows.

Community

Join the SimpleTES community to discuss usage, share research progress, and send feedback. Scan the QR code below to join the chat group:

SimpleTES community chat

Citation

@article{simpletes2026,
  title   = {Evaluation-driven Scaling for Scientific Discovery},
  author  = {WILL Team},
  journal = {arXiv preprint arXiv:2604.19341},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.19341}
}

License

Released under GNU AGPL-3.0-or-later, © 2026 WILL.

  • Research and local use — allowed.
  • Programs discovered by SimpleTES — not automatically AGPL just because SimpleTES found them.
  • Modifying the framework and distributing it — the derivative framework stays AGPL.
  • Exposing a modified version as a network service — you must provide source under AGPL terms.

WILL

About

A general framework for strategically scaling evaluation-driven discovery loops, discovering state-of-the-art solutions on 21 open-ended problems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors