Skip to content

feat: vf v1 <> nano bridge#1576

Draft
mikasenghaas wants to merge 116 commits into
mainfrom
feat/nano-as-v1
Draft

feat: vf v1 <> nano bridge#1576
mikasenghaas wants to merge 116 commits into
mainfrom
feat/nano-as-v1

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 9, 2026

Copy link
Copy Markdown
Member
  • README - high level overview
  • GUIDE - user guide to authoring taskset + harness, cli usage, etc.
  • ARCHITECTURE - explanation of framework internals

mikasenghaas and others added 6 commits June 9, 2026 03:40
…orts

First step of replacing v1 with vf-nano. Deletes verifiers/v1/ wholesale and strips its
surface from verifiers/__init__.py (lazy imports, __all__, TYPE_CHECKING) and
utils/env_utils.py (load_taskset/load_harness + the typed-config/component machinery).
load_environment is now v0-only. Example v1 envs, v1 tests, eval.py v1 path, and docs are
removed in follow-up commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…envs

Removes the 20 v1-native example envs (tau2_bench_v1, hello_*_v1, bfcl_v3, dspy_*, openenv_*,
rlm_swe_v1, sft_replay, mcp_search_env, nemo_gym_env, openai_agents_env, opencode_harbor,
langchain_*, wordle_v1, nested_harness_v1) and their *_v1 siblings; removes the v1 test suite
(test_v1_*, test_eval_cli, test_wordle_v1_env, test_wiki_search_v1, test_mcp_search_env);
strips the v1 flag/branch from the kept v0 envs (reverse_text, alphabet_sort, math_python,
wiki_search). Follow-ups: eval.py/init.py v1 paths, remaining v1 test refs, docs, State v1-contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Vendor vf-nano as a submodule under deps/vf-nano and extend the verifiers package __path__ so
verifiers.nano imports from it; alias verifiers.v1 -> verifiers.nano 1:1 (verifiers.v1.Trace,
.serve.EnvServer, .EnvConfig are the nano objects). Add a v1 extra with nano's runtime + serve
deps. One verifiers package now carries both the v0 API and v1 (=nano).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…verse-text-v1)

Strip the v1 taskset/harness CLI-override path from scripts/eval.py so vf-eval is v0-only;
expose nano's eval as vf-eval-v1 so both run side by side. Bump deps/vf-nano to the
reverse-text-v1 rename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title feat: replace v1 with vf-nano + add v0 legacy bridge feat: v1 v1 <> nano bridge Jun 9, 2026
@mikasenghaas mikasenghaas changed the title feat: v1 v1 <> nano bridge feat: vf v1 <> nano bridge Jun 9, 2026
mikasenghaas and others added 3 commits June 9, 2026 05:00
Remove the v1-only machinery the deleted v1 framework grafted onto State: the _vf_state_contract
contract (+ its guards in every dict method), the runtime/endpoint/tools/runtime-handle method
cluster (get_model/get_client/get_endpoint_config/get_tools/add_tool/_runtime*/strip_runtime_handles),
the for_task borrow/group-state params, and the module-level group-state/borrow helpers. State is
now plain v0: dict semantics + _set_* + stop + timing + finalize + _legacy_for_task. Verified:
State.for_task/stop/finalize and v0 env load work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…doc)

Remove the vf-init --v1/--openenv/--with-harness scaffolding (templates + flags) now that v1 is
vf-nano; vf-init is v0-only. Delete the v1-specific test functions (test_imports, test_init_script,
test_trajectory_processing) and the v1 harness-authoring doc. Remaining: a docs prose pass
(overview/environments/evaluation/reference/training still mention the old v1 API).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
De-submodule vf-nano and vendor it 1:1 into the repo as the verifiers.v1
subpackage, then drop the legacy v1 packages it replaces.

- Copy vf-nano (latest main) in: package -> verifiers/v1/, plus examples/,
  configs/, packages/{tasksets/harbor, harnesses/{default,rlm}}. Remove the
  deps/vf-nano submodule and the verifiers/__init__ __path__ shim.
- verifiers.v1 is now a real subpackage (drop the verifiers/v1.py alias); the
  v0 -> vf.Trace bridge lives at verifiers.v1.legacy.
- Rename nano -> v1 throughout (code, comments, configs); model names like
  gpt-*-nano / Nemotron-Nano are untouched.
- Delete the old-v1 tasksets/harnesses packages and their tests + publish
  workflows; rework pyproject to source/group the v1 plugins (default-installed),
  drop the old extras/conflicts, and relax the plugins to >=3.10.
- Exclude vendored verifiers/v1 from verifiers' ty gate; restore textarena/nltk
  in dev so the v0 textarena env type-checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title feat: vf v1 <> nano bridge feat: vendor the v1 env library + v0 legacy bridge Jun 9, 2026
…gins

- Scripts: the v1 CLIs are now `eval` / `serve` (was `vf-eval-v1`), matching the
  CLI's own usage strings and the example config headers.
- Move the v1 runtime deps (loguru, tomli-w, renderers) into base `dependencies`
  and drop the `v1` extra, so `import verifiers.v1` always works.
- Shipped plugins are vendored by default (no extras): `tasksets` bundles harbor,
  `harnesses` bundles default + rlm. Each plugin is a top-level package resolved by
  id (`import <id>`); example plugins stay standalone under examples/.
- Flatten core: verifiers/v1/harnesses/base.py -> verifiers/v1/harness.py; drop the
  one-module harnesses/ subpackage.
- Bump prime-tunnel>=0.1.8, prime-sandboxes>=0.2.27 (latest).
- Drop the <3.14 cap from the shipped/example plugin pyprojects.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title feat: vendor the v1 env library + v0 legacy bridge feat: vf v1 <> nano bridge Jun 9, 2026
mikasenghaas and others added 3 commits June 9, 2026 07:24
- Drop the "Run Prime sandbox tests" CI step: its tests lived in the removed
  test_v1_runtime_lifecycle.py, so `pytest -m prime_sandbox` collected nothing
  and exited 5.
- Semgrep job: `uv sync --no-default-groups --group policy` (the plugin groups
  are default + declared incompatible with policy, so the old `--no-dev` still
  pulled them and the resolve conflicted).
- Drop Python 3.10: requires-python >=3.11 (+ classifier, CI matrix). With
  renderers/v1 deps in base and example plugins pulling chromadb -> onnxruntime
  (no 3.10 wheel), 3.10 is no longer supported.
- tests/test_envs.py: remove the obsolete v1 tests (alphabet_sort_v1 /
  test_v1_wrapper_*) and the stale prime-pydantic-config exclude-newer cap that
  conflicted with renderers' required version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The .semgrep/verifiers.yml policy enforced the old hand-authored v1's
conventions: env-authoring rules targeting load_environment(config) shims (the
v1 env API is gone), package rules pointing at the old packages/<x>/<x> layout,
State methods that were removed, and a canonical-shim exclude list of
deleted files — plus typing rules (no Any/Mapping/__future__ annotations) that
contradict the vendored vf-nano code (already excluded from the ty gate).

Remove the policy wholesale: .semgrep/verifiers.yml, the Semgrep CI job, the
`policy` dependency group + its uv conflicts, the pre-commit hook, the now-empty
[tool.ruff] exclude, and the dead nosemgrep waivers. A lint policy for the new
architecture can be written against vf-nano separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Task gains `system_prompt: str | None`. Harness adds the `APPENDS_SYSTEM_PROMPT`
  class var + `resolve_prompt`: harnesses that support it emit the system prompt as a
  real system message (default via program.py; rlm via RLM_APPEND_TO_SYSTEM_PROMPT,
  which rlm appends to its generated prompt); others fold it into the user instruction
  with a warning.
- default harness adds a one-line bash system prompt (before the task's) only when
  `enable_bash`.
- reverse_text_v1 sets `system_prompt` separately so its prompt is byte-identical to
  the v0 env ([system, user]) — the model answers directly instead of leaking <think>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The renderer client built its tokenizer/renderer pool from the per-request
`model`, which becomes the LoRA adapter name (e.g. `r32-a64.0`) after a weight
update — there is no HF tokenizer published under that name, so rollouts 404'd.

Add `renderer_model_name` to `RendererClientConfig` (pin it to the base model).
The v1 `RendererClient` and the v0 legacy bridge use it for the tokenizer pool
while the per-request `model` still selects the sampling target, so LoRA
sampling keeps routing by the adapter name. Restores parity with the v0
`ClientConfig.renderer_model_name` wiring used on prime-rl main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mikasenghaas and others added 4 commits June 9, 2026 11:21
The openai_chat_completions client now best-effort parses the prompt and
completion token ids and sampling logprobs that vLLM returns (return_token_ids
+ logprobs) into Response.tokens, so MITO training (no renderer) can train on
real on-policy tokens instead of re-tokenizing the messages downstream.

Sampling args still pass straight through; tokens stay None when the provider
returns neither token ids nor logprobs (e.g. eval, or non-vLLM providers).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bridge only kept token ids: it dropped the prompt messages, the response
message (content / reasoning / tool calls), finish_reason, usage, and the task's
system prompt / answer — so a v0-bridged Trace was a near-empty skeleton next to a
native v1 Trace. The cause: v0 RolloutOutput nests these as pydantic objects
(messages, Response) and records finish_reason on response.message, but the mapping
only handled plain dicts and read finish_reason off the response.

Coerce v0 objects to dicts before mapping (_as_dict), read finish_reason/usage from
their v0 locations, mirror tokens onto the response (as the native client does), and
carry the prompt's system_prompt / instruction / answer onto the task. A v0-bridged
Trace now matches the native v1 schema (verified by diffing reverse-text rollouts).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Rename every taskset under examples/tasksets/ to a `-v1` id (package name,
module, and directory) so they no longer collide with the v0 environments of
the same name (gsm8k, wiki-search, math-env, ...) when both are installed in
one env. reverse-text-v1 was already suffixed; harbor (a bundled taskset with
no v0 counterpart) is left as-is.

- examples/tasksets/<x> -> <x>_v1, module <x>.py -> <x>_v1.py; verify.py /
  server.py / facts.json keep their names (read via __file__, never imported)
- package tasksets: inner package wiki_search/wikispeedia -> *_v1, with their
  self-imports and `-m <pkg>.server` launch paths updated to match
- root pyproject [tool.uv.sources] + examples group, and configs/*.toml
  taskset ids
- refresh uv.lock

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add RetryConfig (attempts / include / exclude) on EnvConfig.retry and retry a whole
rollout with tenacity when it ends with a captured error — parity with v0's
rollout-level retries. Matching is by exception type name; include/exclude name
exception classes (e.g. ModelError, ProgramError). Flags: --retry.attempts /
--retry.include / --retry.exclude. EvalConfig inherits EnvConfig and the env server
runs through Environment.episode, so both eval and training get retries.

Retries are first-class on the Trace: `errors` is the list of per-attempt errors
(oldest first), and `error` is now a computed field returning the most recent — so a
retried-then-failed trace shows every error that led to a retry. Retry utilities live
in verifiers/v1/retries.py.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/trace.py
Comment thread verifiers/v1/legacy.py
* feat(v1): per-rollout token limits (EnvConfig.max_{input,output,total}_tokens)

Add framework-enforced token budgets alongside max_turns: max_input_tokens,
max_output_tokens, max_total_tokens on EnvConfig. The interception server checks
them before each turn via a new RolloutLimits bundle (which also subsumes
max_turns), capping the trace's prompt_len / completion_len / total_tokens
computed properties. Reaching any limit refuses the turn and records it as the
stop condition, and is_truncated now treats the token-limit conditions as
truncation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(v1): drop 'like max_turns' from token-limit field docstrings

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(v1): trim limit-check comment in interception

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(v1): ruff format interception

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/trace.py Outdated
* fix(v1): reclaim orphaned subprocess workspaces

A rollout's /tmp workspace is removed in `stop()`, but a process killed mid-rollout
(SIGKILL, OOM, hard crash, interrupted teardown) never reaches it, so the workspace
leaks with no way to reclaim it — repeated runs eventually fill /tmp ("No space left
on device" at mkdtemp).

Name each workspace `/tmp/v1-<pid>-*` and, once per process on the first `start()`,
sweep `/tmp/v1-<pid>-*` whose pid is no longer alive. PID-keyed, so a concurrent live
process's workspaces are never touched; graceful per-rollout cleanup (`stop()`) is
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): atexit-based runtime teardown; drop the SIGKILL reaper

Make resource cleanup a backend-agnostic property of `Runtime`:
- a sync `cleanup()` is the teardown source of truth; the public async `stop()` runs it
  off the event loop on the happy path.
- `make_runtime` registers each runtime in a WeakSet and arms one sync `atexit` hook that
  calls `cleanup()` on anything still live — so a Ctrl-C / SIGTERM that cancels the
  rollout's `finally` mid-teardown still frees the workspace / container / sandbox, reusing
  each backend's own cleanup. The hook must be sync: at interpreter shutdown the event loop
  and its thread-pool are gone, so async teardown raises "cannot schedule new futures".

Drop the PID-tagged `reap_orphans` startup sweep. A SIGKILL/OOM runs no in-process code at
all, so reclaiming it needs an external mechanism; prime sandboxes already self-terminate
via their server-side max-lifetime, and the local subprocess/docker cases are out of scope.

Prefix workspaces/containers/scripts with `vf-` (was `v1-`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): delete the prime sandbox in the sync atexit cleanup too

`cleanup()` (the atexit backstop) only stopped the tunnels and left the sandbox — the
costly resource — to its server-side max-lifetime. prime_sandboxes ships a sync
`SandboxClient`, so delete the sandbox synchronously there as well (the async client can't
run once the loop is gone). Idempotent with the async `stop` on the normal path: a second
delete just 404s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: move teardown comments off the statement line (ruff format)

The inline comments pushed two lines past the 88-col limit; moving them above the
statement keeps `ruff format` happy without ruff's awkward auto-wrap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): public register/cleanup_at_exit, trim runtime-teardown comments

- rename the module-level helpers to public `register` / `cleanup_at_exit`
- trim the `_LIVE` block comment and drop the inline "no event loop" why-comments
  (the `cleanup` docstring already covers why teardown is sync)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/runtimes/prime.py
Comment thread verifiers/v1/runtimes/subprocess.py
Comment thread verifiers/v1/runtimes/prime.py
mikasenghaas and others added 16 commits June 14, 2026 18:35
…1681)

r2e-gym-v1 hardcoded the GAR prefix on every image, which only pulls on
runtimes with GCP credentials (e.g. Prime sandboxes); a local docker runtime
fails with "denied: Unauthenticated request". Add `R2EGymConfig.use_prime_registry`
(default False): images come from the dataset's public Docker Hub `docker_image`
(`namanjain12/<repo>_final:<commit>`) unless opted in to the registry.

Mirrors the scaleswe-v1 change (#1678). All 4578 R2E-Gym-Subset images are
public on Docker Hub, so the default works on any runtime.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…age (#1683)

The availability filter checked each task's resolved `image`. With
`use_prime_registry=true` that's a private Artifact Registry ref, which
`_available_images` can't enumerate anonymously and so keeps unchecked - making
the filter a no-op exactly when images are pulled from the GAR. Tasks missing
from the GAR (e.g. durandtibo_iden_pr53) then still hit IMAGE_PULL_FAILED.

Filter on the dataset's public Docker Hub `image_url` instead, independent of
the resolved registry: the GAR mirrors Docker Hub, so the public tag set is the
canonical (and only anonymously-checkable) availability signal in both modes.
Now drops the 708 missing tags whether or not `use_prime_registry` is set.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removes the built-in Claude Code harness (added in #1669): deletes
`packages/harnesses/harnesses/claude_code/` and its re-export from the
`harnesses` package `__init__`.

Done as a custom removal rather than `git revert faf7ce1` so the
`RetryingClient.relay_aux` passthrough #1669 also added is kept - it's shared
aux-relay plumbing (the base/eval `relay_aux` and the interception call predate
#1669), and the Anthropic dialect it serves stays in place for Anthropic-native
agents. A straight revert would have dropped it.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1685)

* fix(rlm-harness): install/run without root so the subprocess runtime works

`uv run eval ... --harness.id rlm --harness.runtime.type subprocess` crashed with
`FileNotFoundError: 'rlm'`. The harness forced rlm's installer to `/usr/local/bin`
and prepended an unconditional `apt-get`, both root-only; on a non-root host the
install silently failed and the bare `rlm` exec then raised FileNotFoundError
(the subprocess runtime inherits the host PATH, where rlm wasn't installed).

rlm's install.sh already fetches curl/uv itself (via the runtime's package
manager, guarded) and defaults its install dir to a user-writable path. So:

- Install uv + the rlm CLI into a fixed user-writable dir (`/tmp/vf-rlm/bin`) and
  run the binary by absolute path - no root, no PATH dependency. Works on a
  non-root host and a root container alike.
- Only `apt-get` for git (needed for the pinned checkout) when it's missing, so
  a host that already has git needs no root.
- Check the install result and raise a clean ProgramError on failure, instead of
  letting a missing binary surface as an uncaught FileNotFoundError (matches the
  codex harness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(rlm-harness): flock-serialize install so shared-runtime rollouts don't race

Concurrent rollouts on one runtime (subprocess on the host) all clone/install
into the same /tmp dirs and clobber each other (git 'destination already exists'
/ refs-backend abort). Guard the install with flock: the first installs, the rest
wait and reuse the binary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(codex-harness): install/run without root, pinned to /tmp/vf-codex

Apply the same convention as the rlm harness: install the codex binary into a
user-writable /tmp/vf-codex/bin (not root-only /usr/local/bin) and run it by
absolute path (not a bare `codex` on $PATH), fetch curl only when missing, and
flock-serialize the install so concurrent rollouts sharing one runtime don't
race the download. Makes codex work on the subprocess (non-root host) runtime,
consistent with rlm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(codex-harness): drop redundant install comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(rlm-harness): drop redundant install comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… region-limited) (#1686)

* test(v1): skip own-runtime prime port-exposure e2e cases (region-limited)

test_task_tools_own_runtime[prime] / test_user_own_runtime[prime] run a tool /
user-sim server in its own prime sandbox, which must publish its port back to the
host via native `expose` — currently region-limited (see PrimeRuntime.public_url),
with no host-localhost fallback for a port inside a remote sandbox. The old
`skip_if_unexposable` only skipped when the trace error contained "port exposure",
so any other failure (e.g. provisioning) hard-failed instead.

Make it an explicit, upfront skip for the prime case (before provisioning), with
a TODO to re-enable once prime supports port exposure in all regions (or the
runtime publishes the port via an in-sandbox tunnel). subprocess/docker are
unaffected (they share the host network).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): refocus prime port-exposure skip on test_multi_turn

The actual failing case is test_multi_turn[*-prime]: its user-sim is colocated
in the agent's prime sandbox and host-reachable, so it must publish its port via
native expose (region-limited) - but unlike the own-runtime tests it had no
skip_if_unexposable guard, so it hard-failed. Add the existing guard to it.

Reverts the previous over-broad change to the own-runtime tests (which already
had the guard) + the conftest fixture rewrite; only adds a TODO to the fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Support images in v1 tool responses

* Add e2e taskset for image tool responses
* Add Harbor task multipliers

* Remove Harbor multiplier tests

* Remove TerminalBench config hint
* fix(v1): extend bash tool timeout

* Increase bash command timeout to 3600 seconds

Increase timeout for bash command execution from 60 minutes to 3600 seconds.
* Use Prime CLI config for v1 eval

* Gate Prime config by inference URL

* Detect Prime inference hosts
* chore: flatten examples/ into a single environments/ section

Move the v1 example tasksets (examples/tasksets/*) and the compact harness
(examples/harnesses/compact) into the flat environments/ directory, alongside
the standalone v0 environments — no more examples/ tree.

- [tool.uv.sources]: paths examples/tasksets/<x> -> environments/<x>,
  examples/harnesses/compact -> environments/compact (package names unchanged)
- eval/serve/validate CLIs: the -h example listing now scans environments/
  (a single flat list, since tasksets/harnesses are no longer split by dir)
- GUIDE/README/loaders doc references updated

Package names, the `examples` dependency-group (a curated default-install set,
referenced by name not path), and default-groups are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): drop local_examples help hint

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/GUIDE.md
Comment on lines +280 to +285
uv run eval gsm8k-v1 -n 5 -r 3 \
--max-turns 8 --max-total-tokens 8192 \ # per-rollout budgets
--retries.model.max-retries 3 --retries.runtime.max-retries 3 \ # retry one failed call
--retries.rollout.max-retries 3 --retries.rollout.include ProgramError \ # retry a whole rollout, by error type
--timeout.rollout 600 --timeout.scoring 120 # per-stage wall-clock caps (seconds)
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low v1/GUIDE.md:280

The bash examples on lines 280-285 place inline comments (# per-rollout budgets, etc.) after \ line continuations. In bash, \ must be immediately followed by a newline to continue the line — any trailing space or comment causes a parse error when the command is copy-pasted. Consider moving the comments above each line or removing them from the continuation lines.

 ```bash
 uv run eval gsm8k-v1 -n 5 -r 3 \
-  --max-turns 8 --max-total-tokens 8192 \                          # per-rollout budgets
-  --retries.model.max-retries 3 --retries.runtime.max-retries 3 \  # retry one failed call
-  --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \  # retry a whole rollout, by error type
-  --timeout.rollout 600 --timeout.scoring 120                      # per-stage wall-clock caps (seconds)
+  --max-turns 8 --max-total-tokens 8192 \
+  --retries.model.max-retries 3 --retries.runtime.max-retries 3 \
+  --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \
+  --timeout.rollout 600 --timeout.scoring 120


<details>
<summary>🚀 Reply "<strong>fix it for me</strong>" or copy this <strong>AI Prompt</strong> for your agent:</summary>

```text
In file @verifiers/v1/GUIDE.md around lines 280-285:

The bash examples on lines 280-285 place inline comments (`# per-rollout budgets`, etc.) after `\` line continuations. In bash, `\` must be immediately followed by a newline to continue the line — any trailing space or comment causes a parse error when the command is copy-pasted. Consider moving the comments above each line or removing them from the continuation lines.

…ment (#1698)

* feat(v1): vf-native Toolset/User class surface + per-server runtime placement

Author tool/user servers as classes (no FastMCP, no separate server.py): a
`vf.Toolset` with `@vf.tool` methods + `setup()`, or a `vf.User` with a single
`respond()` hook. `@vf.tool` reuses the existing `mark`/`discover_decorated`
machinery; a generic `verifiers.v1.toolserver` launcher serializes the class,
rebuilds it in a runtime, and serves it over MCP.

Placement (colocated / shared / own runtime) moves onto each server's `config`
(`vf.ToolsetConfig` / `vf.UserConfig`), so different servers can run in different
runtimes. The default is the server's OWN host (subprocess) runtime: it runs where
the eval's deps live and the harness reaches it over the host network (docker
--network host) or a tunnel (prime), so a fresh docker/prime sandbox needs nothing
installed. The redundant taskset-level tools/user config defaults are removed.

Ports all server-bearing examples to the class surface (glossary, wiki_search,
wikispeedia, alphabet_sort, color_codeword, textarena); deepwiki stays on `url`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): config-initialized Toolset/User classes + per-data-kind channels

Reshape the vf-native surface to mirror Taskset/TasksetConfig: a `Toolset`/`User`
is a plain class initialized from its config (`cls(config)`), not a pydantic model
holding fields. The config (`ToolsetConfig`/`UserConfig` subclass) is the
serializable data; the class is behaviour. This removes the pydantic-on-behaviour
awkwardness (per-rollout state is now a plain `self.x`, no `PrivateAttr`).

Each kind of data has its own channel, instead of all living on the object:
  - genuine config (CLI-tunable knobs: placement/runtime, wikispeedia links_only):
    a `ToolsetConfig`/`UserConfig` subclass — serialized to the server.
  - global state (facts corpus, wiki graph): module-level or built in `setup`
    from disk/dataset, server-side — never config.
  - per-task input (wikispeedia source/target, alphabet_sort follow_ups): read off
    the rollout's task in `setup(self, task)` — the framework ships the task.
  - per-rollout mutable state (turns, path, game): plain attrs set in `setup`.

The launcher rebuilds `cls(config)` and calls `setup(task)`; `server_to_tools`
serializes the config + task (refs + JSON). Examples updated accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): single internal launcher; drop raw Tools; config polish

Internals: one `serve(server, task, agent_runtime, for_host)` launcher handles any
vf-native server (Toolset OR User) — colocated or its own runtime, shared or per
rollout, with reachability resolved by consumer (host-driven user vs model-called
tool). `serve_tools`/`serve_shared`/`serve_user` are now thin wrappers over it (an
`AsyncExitStack` for teardown), replacing three near-duplicate implementations.

Surface:
  - Remove the raw `vf.Tools` authoring escape hatch — tools are `vf.Toolset`, users
    are `vf.User`, only. `Tools` becomes a private `_Launch` descriptor. A remote MCP
    endpoint is a `vf.Toolset` with `url` on its config (deepwiki). The dead `headers`
    field is dropped.
  - `name` is a class `ClassVar` (an identity, like `deps`), not a config field — so a
    `--taskset.tools.runtime.type docker` override can't drop the tool prefix.
  - Per-server config registered on the taskset config (`tools` / `user` fields), so
    placement is CLI-tunable (`--taskset.tools.shared false`, `--taskset.user.runtime.type ...`).
  - `setup(self, task)` sets plain public instance attrs (no leading underscores). `@vf.tool`
    no longer takes `priority` (tools are an unordered set).

Fixtures/tests: `echo_multi_v1` → `vf.User`; drop `echo_tool_v1` and the two
own-runtime matrix tests (a bare sandbox can't run an unpublished vf-native server;
that path is covered by the host-side default in the harness x runtime matrix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): render one uv-script per server; drop the -m launcher

Unify the launch path on a single rendered PEP 723 uv-script per vf-native server
(`server_to_tools` → `_render_script`), `uv run` in any runtime — no separate host
`command` path. On a host (subprocess) runtime the script pins `verifiers` + the
taskset package to their local editable checkouts via `[tool.uv.sources]`, so it
runs from the dev tree with no publishing; in a sandbox those resolve from PyPI.
The script is written to a content-addressed path so uv keys one resolved env per
distinct script, shared across rollouts. Removes `verifiers/v1/toolserver.py`, the
`_Launch.command` field, and `sys.executable` plumbing; `_editable_dist` resolves a
top-level module to its (distribution name, editable path).

Also move `UserConfig` to `user.py` next to `User` (it was in `tools.py` only for
import ordering; `tools.py` never used it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): plain PEP 723 header, no [tool.uv.sources]

The rendered server script is now a vanilla uv-script — `# /// script` with a
`dependencies = [...]` header and nothing else. The host/sandbox split moves to how
it's launched (`serve_in_runtime`): on a subprocess (host) runtime it runs with the
eval's own interpreter (deps already installed editable, header ignored, no fetch,
no publishing); in any other runtime it's `uv run`, resolving the header from PyPI.
Drops the `[tool.uv.sources]` editable-path block and `_editable_dist`; restores the
name-only `_server_distribution`. `server_to_tools` no longer takes a runtime type.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* poc(v1): render servers as standalone uv-scripts (vendored runtime, no verifiers)

The rendered server script no longer imports `verifiers` or the taskset package.
Instead `server_to_tools` vendors a dependency-light runtime (`verifiers/v1/_serverkit.py`,
read as source — never imported at serve time) into the script and inlines the
server's own config + class source; it reconstructs `cls(config)` against that
runtime and serves. So a tool/user server ships as a self-contained PEP 723 uv-script
whose only deps are `mcp` + `pydantic` + `uvicorn` + the class's own declared `deps`
— all public PyPI — and `uv run`s in any runtime (incl. a fresh sandbox) with nothing
pre-installed and no publishing. Drops `_server_distribution`/`_ref`.

This requires the server to be self-contained (the boundary contract): it may only
touch the runtime, its config, the task, and its declared deps — no taskset module
globals or sibling imports. Examples updated accordingly:
  - glossary: facts move onto the config (server data, shipped as JSON);
  - wiki_search: the corpus + chroma index build moves into `setup` (deletes corpus.py);
  - wikispeedia: the SNAP article/link load moves into `setup` (stdlib only);
  - color_codeword: the square-rendering helpers move into the user class (deps=["pillow"]);
  - textarena: `latest_feedback` + `OUTCOME_FILE` move onto the user class.

Verified: all six render to valid verifiers-free scripts that serve; glossary (1.0)
and alphabet_sort (user_completed) pass e2e on the default docker harness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): launch tool/user servers via a full-verifiers runtime

Replace the rendered, verifiers-free PEP 723 uv-script (a vendored `_serverkit`
plus the class source inlined via `inspect.getsource`) with a generic launcher:
`python -m verifiers.v1.toolserver` imports the real `Toolset`/`User` class from
its installed env module and serves it over MCP.

- Host (`subprocess`) runtime: run with the eval's own interpreter — `verifiers`
  and the env module are already installed, nothing is fetched.
- Sandbox runtime: upload the env package and `uv pip install` it (pulling
  git-pinned `verifiers`, now declared as an env-package dependency) before
  running the launcher.

This lifts the self-containment contract — servers may freely `import verifiers`,
import siblings, and use module-level globals — and deletes `_serverkit.py` and
the render/inline machinery. The task is reconstructed from its real subclass
(`VF_TASK_CLS`), so taskset-specific fields validate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): pin sandbox verifiers to the launcher commit + ensure git

Point `_VERIFIERS_PIN` at the pushed commit that has the generic launcher, and
install a git client in the sandbox before the git-pinned `verifiers` install
(slim base images lack one). Verified end-to-end: glossary-v1 tool server in a
docker runtime (in-container install of git-pinned verifiers + the env package)
and in modal; alphabet-sort-v1 user simulator on subprocess.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): reach a modal-hosted server via modal's own port forwarding

A host-side harness/framework couldn't reach a tool/user server hosted in a modal
sandbox: modal publishes sandbox ports (not host ones), but the runtime only
implemented `expose` (host -> sandbox via prime_tunnel), so `public_url` fell back
to localhost and the connection failed.

Implement `public_url` on the modal runtime using modal's native forwarding: reserve
a fixed internal service port via `encrypted_ports` at `Sandbox.create` and read its
public URL back from `sandbox.tunnels()`. A new `Runtime.published_port` hook lets a
self-publishing runtime pre-declare that port; `serve` binds it instead of a free port
and the server listens on `0.0.0.0` (MCP_HOST) so the tunnel can forward to it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): relax MCP DNS-rebinding guard for tunnel-hosted servers

FastMCP auto-enables DNS-rebinding protection (allowed_hosts=localhost only) when
created with the default host, so a server reached via a sandbox tunnel host (e.g.
modal's *.modal.host) is rejected with 421 Misdirected Request. When bound to
0.0.0.0 (a self-publishing runtime behind a tunnel), disable the guard — the tunnel
is the trust boundary and the client is ours, not a browser.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): upload working-tree verifiers source to sandboxes (drop the git pin)

Instead of installing a git-pinned verifiers in a sandbox, upload the developer's
working-tree verifiers source (its wheel-build inputs) alongside the env package and
`uv pip install` both. The sandbox runs the exact local code, so there's no push, no
pin to bump, and no git client needed in the base image; deps resolve from PyPI off
the uploaded pyproject.

Verified end-to-end from an uncommitted tree: glossary-v1 tool server in docker and
in modal, reward 1.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): tidy vf-native example servers

- Rename the `name` ClassVar to `TOOL_PREFIX` (the model-facing tool prefix), default "".
- Promote fixed server data from config fields / class attrs to module constants
  (glossary FACTS, color COLOR_RGB, wiki-search DATASET, textarena OUTCOME_FILE, the
  vision fixture's PNG_DATA).
- Drop the now-dead `deps` ClassVar (deps come from each env package's pyproject) and
  the redundant placement docstrings on tools/user config fields.
- Fix stale docstrings referencing the removed render path / server.py / colocated default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): reach a prime-hosted server via native port exposure

Unify modal + prime as self-publishing runtimes: share a fixed `_SERVICE_PORT` returned
from `published_port`, so `serve` binds it on 0.0.0.0 and relaxes FastMCP's DNS-rebinding
guard (the public sandbox host would otherwise 421). Prime's `public_url` already exposes
the port via the SDK (`client.expose` -> `ExposedPort.url`); make modal's service port an
internal constant rather than a config knob.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): a shared server's setup gets no task (was silently tasks[0])

A `shared` tool server is built once for the whole eval, but `shared_tools`/`serve_shared`
passed `tasks[0]` into its `setup` — so a shared server that read the task silently set up
from one representative task and served it to every rollout, contradicting the documented
contract (`setup`'s task is "None for a shared server").

Pass `None` instead: `server_to_launch` omits VF_TASK/VF_TASK_CLS when there's no task, the
launcher hands `setup` `None`, and a shared server that touches the task now fails loudly
rather than silently serving one task's data to the whole eval. The shared example
(wiki-search) is unaffected — its setup builds the corpus and never reads the task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): assert a shared server is launched without a task

Belt-and-suspenders for the shared-server contract: `serve` raises an informative
ValueError if a `shared` server is launched with a task (it must be task-agnostic),
instead of relying on its `setup` happening to fail on None.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): trim the _SERVICE_PORT comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): make SERVICE_PORT and TUNNEL_LIMITER public

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): runtime reports is_local; merge expose/public_url; host tunnels caller-side

The two runtime network methods were asymmetric: `expose` (reach a HOST port from inside a
runtime) was host-side and provider-agnostic — the interception pool even faked a throwaway
runtime just to call it — while `public_url` (publish an IN-runtime port) was provider-native.

- `Runtime.is_local` (class attr): subprocess/docker True, modal/prime False.
- Merge the two into one `Runtime.expose(port)` = publish a port running inside this runtime
  (modal `tunnels()`, prime `client.expose`); None when local.
- `host_endpoint(port, is_local)`: a host-side async context manager that reaches a host port
  from inside a runtime — localhost when local, else one `prime_tunnel`. The interception pool,
  rollout, and tool serving call it; the runtime no longer reimplements the tunnel.

The pool reads `runtime_is_local(config)` off the runtime class (no throwaway runtime) and owns
its server + host tunnel on one AsyncExitStack, instead of one redundant tunnel per remote
runtime. Verified e2e: glossary-v1 reward 1.0 on subprocess, docker (harness + tool runtime),
and modal tool runtime; modal/prime-as-harness interception (prime_tunnel) untested — prime down.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): servers self-launch via their module; split setup/setup_task

Drop the generic `toolserver.py` shim — each server module is self-runnable. The
framework launches `python -m <cls.__module__>`; the module's `__main__` (or a package
`__main__.py`) calls `ServerBase.run()`, which rebuilds the server from the environment
(`VF_CONFIG` JSON + `VF_TASK`/`VF_TASK_CLS`, or `cli(config)` for a manual debug run — the
config class is read off the `Toolset[Config]` generic) and serves it. This works in any
runtime: host (ambient), or a sandbox after `_install_in_sandbox` makes the module
importable, reached via `run_background([python, "-m", module])`.

Consolidate the launch internals: move the serve loop onto `ServerBase._serve`, inline the
former `run_mcp_server` (and drop its stale export), and fold `server_to_launch`/`_Launch`
into `serve_in_runtime(server, task, runtime, port)`. Net: `serve_server`, `run_mcp_server`,
`server_to_launch`, `_Launch`, and `toolserver.py` are gone.

Split the setup hook: `setup(self)` (task-agnostic, runs for every server) +
`setup_task(self, task)` (per-rollout, SKIPPED for a shared server). `serve()` warns loudly
if a shared server overrides `setup_task` (its per-task logic would never run). Examples
migrated; wiki-search's corpus build is now `setup` (shared), wikispeedia/textarena split
global vs per-task, the user sims use `setup_task`.

Verified e2e on subprocess: glossary 1.0, alphabet-sort user-sim drives multi-turn
(stop=user_completed), plus flat-module (`-m glossary_v1`) and package (`-m alphabet_sort_v1`
via `__main__.py`) launch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): name the ToolsetConfig placement validator descriptively

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): rename _VERIFIERS_BUILD_INPUTS -> VF_BUILD_INPUTS

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): move tool/user/server code into a verifiers.v1.mcp subpackage

Split the cramped `tools.py` + `user.py` into `verifiers/v1/mcp/`:
- `server.py` — `ServerBase` (the base authoring class + `run`/`_serve`/`setup`/`setup_task`)
- `toolset.py` — `Toolset` + `ToolsetConfig`
- `user.py` — `User` + `UserConfig`
- `launch.py` — host-side launching: `serve`/`serve_tools`/`serve_shared`/`serve_user`/
  `connect_user` + the runtime mechanics (`serve_in_runtime`, `_install_in_sandbox`, …)
- `__init__.py` — re-exports the public surface

No behavior change. Importers updated (`verifiers.v1`, taskset, rollout, env, interception).
The dependency graph is a clean DAG (server ← toolset/user ← launch). Verified e2e on
subprocess: glossary-v1 1.0, alphabet-sort-v1 user-sim 1.0 (stop=user_completed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): drop prime cleanup/stop tunnel loops (self._tunnels was removed)

cleanup()/stop() still iterated self._tunnels after __init__ stopped initializing it (the
prime_tunnel-based expose is gone), which would AttributeError on teardown. Removed the loops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): the env owns serving (shared tools + interception), injected into rollouts

Shared tool servers and the interception pool are eval-level resources, but each
eval runner stood them up itself: run_eval (in-process) entered both, while the
env-server worker pool only entered the interception pool and never set up shared
tools. So a shared server ran per rollout *with* a task through the env-server path
(the non-rich CLI default and prime-rl's path) - rebuilding an expensive corpus
each rollout, and tripping the shared-vs-task assertion ("shared server was
launched with a task").

Make the Environment own its serving resources in one place:

- Environment.serving(tasks) enters shared_tools + interception_pool and stashes
  them; Environment.episode() injects them into every Rollout at construction.
- Episode.run / Rollout.run / run_with_retry drop their shared_urls/interception
  params - no runner threads them through anymore.
- Both run_eval and EnvServer build episodes inside `async with env.serving(...)`.
  LegacyEnvServer overrides serving() to a nullcontext (v0 runs its own rollouts).

The bug went unnoticed because the e2e suite only exercised run_eval, never the
env-server pool. Add a run_v1_server fixture (run_eval_server, static 1-worker
pool) and test_shared_tools_via_env_server (glossary-v1 tools.shared=True through
the pool) to cover that path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): put the fixture dir on PYTHONPATH so self-launching servers resolve in subprocesses

A self-launching tool/user server runs `python -m <module>` in a fresh subprocess.
That inherits PYTHONPATH but not pytest's in-process `pythonpath` ini, so a fixture
server module (echo_multi_v1, tool_response_image_v1) failed to import there ("No
module named ...") while an installed example package (glossary_v1) resolved fine.
Add a `pytest_configure` that puts tests/v1/fixtures on PYTHONPATH for spawned
subprocesses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): exclude .venv/.git from sandbox source uploads

_tar_source (uploads the env package to a docker/prime sandbox) only skipped
__pycache__, so an env package whose dir contains a .venv would tarball
gigabytes (a .venv is many GB) into an in-memory gzip and stream it over
`docker exec -i cat` - effectively an infinite hang on the first docker/prime
rollout. Skip a denylist of build/VCS/cache dirs (.venv, .git, .pytest_cache,
.mypy_cache, .ruff_cache, node_modules, __pycache__) so only real source ships.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): e2e matrix over server runtime x agent runtime + multimodal VLM

Restructure the v1 e2e tests around the three runtimes a rollout places things
in - the user simulator's, the tool server's, and the agent (harness) runtime:

- test_user (merge of the old test_multi_turn + test_user_sim_placement): a
  vf.User across user_runtime (colocated / own runtime: subprocess/docker/prime)
  x agent_runtime.
- test_tool (merge of test_tool_placement + test_multi_turn_with_tools +
  test_shared_tools_via_env_server): a vf.Toolset across tool_runtime (colocated
  / shared / own runtime) x agent_runtime; the shared case runs through the
  env-server pool (regression guard for serving shared tools once per eval).
- echo_tool_v1 fixture: an echo tool that stamps its output with a token the
  prompt never reveals, so reward 1.0 proves the tool was reachable and ran.
- echo_multi_v1 -> echo_user_sim_v1 (clearer name); drop the now-unused
  harness_supports fixture.
- test_tool_response_image uses a vision model (qwen/qwen3-vl-8b-instruct); the
  default text model has no image route.
- tests/v1/fixtures/pyproject.toml: package the fixtures so a sandbox installs
  just this dir (its own pyproject) instead of climbing to the repo root and
  tarring the whole tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): self-describing parametrize ids for the e2e matrix

Give every fixture param an explicit id so a case reads as a sentence instead of
`[rlm-subprocess]`:
- agent runtime -> `in-<rt>-runtime`; harness -> `<name>-harness`
- user runtime  -> `with-user-colocated` / `with-user-in-<rt>-runtime`
- tool runtime  -> `with-tool-colocated` / `with-tool-shared` / `with-tool-in-<rt>-runtime`

e.g. `test_single_turn[rlm-harness-in-subprocess-runtime]`,
`test_tool[in-docker-runtime-with-tool-shared]`. agent_runtime leads the user/tool
signatures so the agent's runtime reads first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): drop the redundant -runtime suffix from parametrize ids

`in-subprocess-runtime` -> `in-subprocess`, `with-tool-in-docker-runtime` ->
`with-tool-in-docker`, etc. Reads the same, less noise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): clear error when VF_TASK is set without VF_TASK_CLS + ruff format

ServerBase.run() read os.environ["VF_TASK_CLS"] directly, so a VF_TASK without
its paired VF_TASK_CLS raised a bare KeyError. The framework always sets both
together (launch.py), so this only bites a manual/misconfigured launch - raise a
descriptive ValueError instead. Also apply `ruff format` (earlier commits were
format-clean under `ruff check` but not `ruff format`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(v1): ruff format interception/pool.py + runtimes/base.py

Format-only (line wrapping); these were format-clean under `ruff check` but not
`ruff format`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): example envs put each server in its own self-launching servers/<name>.py

Separate server code from taskset code: each env's tool/user server moves out of the taskset
module into <env>/servers/<name>.py, a self-launching module ending with
`if __name__ == "__main__": <Server>.run()` (framework launches `python -m <env>.servers.<name>`).
The taskset module imports the server from .servers and uses it in tools()/user(); shared
constants/data the server needs live in the server module. Flat envs (glossary, deepwiki) become
packages; package envs drop their __main__.py.

- glossary -> servers/facts.py (+ facts.json beside it)
- deepwiki -> servers/deepwiki.py
- alphabet_sort, color_codeword -> servers/user.py
- wiki_search, wikispeedia -> servers/wiki.py (wikispeedia keeps graph.py in the package root)

GUIDE.md "Tools and user simulators" rewritten to the current vf-native surface (vf.Toolset /
vf.User classes, @vf.tool / respond, setup / setup_task, the servers/<name>.py layout, per-server
placement with own-host-runtime default; tools + users can coexist).

Verified: all 6 envs' server classes resolve to <env>.servers.<name>; glossary (tool) reward 1.0
on subprocess + docker; alphabet-sort (user sim) reward 1.0 on subprocess + docker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): bridge a shared host tool to the host when the harness runs remotely

A `shared` tool on a host (subprocess/docker) runtime yielded a plain `http://127.0.0.1:<port>`
URL, because serve_shared called serve() with no agent context so serve() took the `else` branch
(`expose() or local`). That's reachable from a host-network harness but DEAD to a harness in a
prime/modal sandbox — the per-rollout path bridges via host_endpoint, the shared path had no
equivalent and nothing validated it (untested: prime was down).

Thread the harness runtime's locality into the shared path: Environment.shared_tools passes
`runtime_is_local(harness.runtime)` -> serve_shared -> serve(agent_is_local=...), and serve()'s
own-runtime/shared branch is unified to `expose(port) or host_endpoint(port, harness_local)`. So a
shared host tool now gets one host tunnel (reused by every rollout) when the harness is remote,
localhost when it's local, and a remote tool runtime still publishes its own URL.

Verified: shared tool + subprocess harness (env-server path) still reward 1.0. The shared + remote
harness case mirrors the per-rollout bridge but is still untested (prime infra down).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): address review findings — colocated port clash, connect_user mislabel, config MRO

- serve(): a colocated server is reached in-sandbox at localhost, so it now takes a free
  in-sandbox port instead of the runtime's published_port (a fixed SERVICE_PORT). Two colocated
  servers sharing one remote sandbox (a colocated tool + user, or two tools) would otherwise both
  bind SERVICE_PORT and the second's probe would fail. published_port is reserved for actually-
  exposed ports (a for_host server, or a tool in its own remote runtime) — and since only the one
  for_host server per rollout ever exposes, modal's single encrypted SERVICE_PORT suffices.
- connect_user(): an exception from the harness body (thrown back at the yield) was caught with
  connected=True and re-wrapped as "connection lost", misdirecting debugging. Track an in_body
  flag and propagate body exceptions untouched; only genuine transport failures are wrapped.
- ServerBase._config_cls(): walk the MRO so a further subclass that doesn't re-parameterize
  (class B(MyToolset)) inherits its config instead of raising "must parameterize its config".
- Docs: note _free_port()'s accepted TOCTOU window (covered by the retryable probe) and that the
  subprocess API_KEY strip also applies to a task's tool/user server (use its own runtime if it
  needs a key).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: ruff format textarena_v1 under ruff 0.15.17

CI's ruff-action pins no version so it runs the latest (0.15.17), which formats this file
differently than a local 0.15.12 (a blank line + a couple of wraps). Format-only; brings the repo
clean under the CI ruff so the Ruff check passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): one reachability resolver (reachable_url) for serve + interception

The "which URL is reachable from where" logic was open-coded in three places (serve()'s
for_host/colocated/own-or-shared branches, the interception pool, and the per-rollout interception
fallback), all over the same two primitives. Lift it into a single resolver that owns the table:

  reachable_url(service, port, *, consumer)   # service/consumer each a Runtime or HOST
    - same place (colocated, or host->host)        -> localhost
    - service in a sandbox (remote runtime)         -> its own expose() (reachable anywhere)
    - service on the host network, consumer remote  -> a host_endpoint tunnel

`serve` (tools/users), InterceptionPool, and Rollout._serve_interception now all route through it,
so port exposure / tunneling lives in one auditable function. The two primitives (Runtime.expose
out, host_endpoint in) are unchanged. No behavior change — verified across the full non-prime e2e
matrix (server-runtime x agent-runtime, plus interception via every harness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(v1): drop redundant top-level docstrings from env server modules

The servers/<name>.py modules just restated their class + a "self-launching python -m ..." line;
the class docstring + the GUIDE cover it. Removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): move codex to the agentic matrix (it no-ops on the echo single-turn task)

codex is an autonomous coding agent: on `test_single_turn`'s no-op chat echo it often completes its
loop without ever calling the model (0 nodes -> reward 0), flakily (some runs/docker it does reply).
A stricter prompt didn't help and a lighter reward can't match zero output. On a task with a concrete
action it's reliable, so move it from the `harness` fixture (single-turn) to `agentic_harness`
(echo-agentic file write) — verified codex reward 1.0 there (subprocess + docker). rlm/kimi-code still
cover an agent CLI on the simple single-turn task.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): servers bind an OS-assigned free port and report it back (drop _free_port)

_free_port() probed the HOST's 127.0.0.1 for a free port, then handed it to the server — a TOCTOU
race, and outright wrong for a colocated tool in a remote sandbox (host-free != sandbox-free; it
could even draw SERVICE_PORT). Instead the server now binds its own socket: MCP_PORT when the
framework fixed one (a self-publishing runtime's forwarded port), else port 0 — an OS-assigned free
port, guaranteed free in whatever environment the server actually runs in. It writes the bound port
to MCP_PORT_FILE before setup; serve_in_runtime reads it back (and returns it). Same pattern the
interception server already uses (bind 0 + getsockname). _free_port is gone.

Verified: full non-prime e2e matrix (test_user + test_tool, subprocess + docker, every placement,
incl the cross-boundary port readback) — 15 passed under -n auto.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): textarena user must be colocated (OUTCOME_FILE handoff needs a shared workspace)

TextArenaUser writes the game outcome to OUTCOME_FILE via a local open() and game_reward reads it
via runtime.read() on the harness rollout's runtime. That only works if the user shares the
harness's runtime/workdir — but this branch flipped the UserConfig default to colocated=False, so
the user ran in its own workspace and the outcome file was never where scoring looked → reward
always 0.0. Pin colocated=True on the taskset's user config (the docstring already assumed it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): textarena raises in setup if its user isn't colocated

Belt-and-suspenders on top of the colocated=True default: if someone overrides
`--taskset.user.colocated false`, the OUTCOME_FILE handoff silently breaks (reward always 0), so
TextArenaUser.setup now fails loudly with the reason instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
os.rename(f"{tar}.part", tar)
if not (cache / subdir).exists():
with tarfile.open(tar, "r:gz") as t:
t.extractall(cache, filter="data")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low servers/wiki.py:42

tarfile.extractall(filter="data") raises TypeError on Python versions before 3.11.4 (and before 3.10.12), because the filter keyword argument did not exist yet. This crashes setup at runtime on those interpreters. Consider wrapping with a try/except TypeError fallback.

-                    t.extractall(cache, filter="data")
+                    try:
+                        t.extractall(cache, filter="data")
+                    except TypeError:
+                        t.extractall(cache)
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/wikispeedia_v1/wikispeedia_v1/servers/wiki.py around line 42:

`tarfile.extractall(filter="data")` raises `TypeError` on Python versions before 3.11.4 (and before 3.10.12), because the `filter` keyword argument did not exist yet. This crashes `setup` at runtime on those interpreters. Consider wrapping with a `try`/`except TypeError` fallback.

Evidence trail:
- environments/wikispeedia_v1/wikispeedia_v1/servers/wiki.py line 42: `t.extractall(cache, filter="data")` at REVIEWED_COMMIT
- pyproject.toml line 14: `requires-python = ">=3.11,<3.14"` at REVIEWED_COMMIT
- Python 3.11 docs (https://docs.python.org/uk/3.11/library/tarfile.html): 'Нове в версії 3.11.4' (New in version 3.11.4) for extraction filters
- CPython issue #102950 tracking backport to 3.11: https://github.com/python/cpython/issues/102950
- Python docs recommend `hasattr(tarfile, 'data_filter')` for compatibility checking

Comment thread verifiers/v1/mcp/launch.py
mikasenghaas and others added 2 commits June 16, 2026 15:43
…arness.py (#1708)

* chore(v1): resolve plugins via __all__ export, split into taskset.py/harness.py

Replace the per-plugin load_taskset/load_harness hook with an __all__ export.
The loader imports a plugin module, walks its __all__, and finds the single
Taskset/Harness subclass; config and task types are read off that class's
Taskset[TaskT, ConfigT] / Harness[ConfigT] generic (most-derived first, so a
thin wrapper that re-binds the config wins). Zero or >1 exported subclasses
raise an informative error.

Restructure every v1 taskset/harness so __init__.py only re-exports + declares
__all__, with the implementation in taskset.py / harness.py. Single-file envs
become packages (aux scripts move alongside; hatch build switched to a wheel
packages target).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): drop trivial re-export docstrings from plugin __init__.py

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): rename _plugin_class -> _exported_subclass

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Revert "chore(v1): rename _plugin_class -> _exported_subclass"

This reverts commit c45cdc9. Keep the original `_plugin_class` name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1710)

* feat(v1): let the user simulator open the conversation when a task has no prompt

A task may now omit its prompt: Task.instruction is optional (default None). When
a task carries no prompt and the taskset defines a user simulator, the interception
server seeds the simulator's opening turn — respond("") — into the request before
the first model call, so the model answers a user message rather than an empty
prompt. The existing post-turn loop then drives the remaining turns unchanged.

- Task.instruction: str | Messages | None (default None).
- dialect.extend accepts a None completion (append only the user turn(s)); used to
  seed the opening turn before the model has spoken.
- Interception server seeds the opening turn, guarded to num_turns == 0 so a later
  program request (e.g. after a tool call) never re-seeds.
- DefaultHarness + its program emit no opening user message for a None instruction;
  resolve_prompt allows a None instruction.
- Validation: a None-instruction task needs a user sim (per-rollout ProgramError);
  a user-sim taskset needs a SUPPORTS_USER_SIM harness (Environment check, mirroring
  the existing task-tools check).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(alphabet-sort-v1): demonstrate the user-opens-the-conversation path

Add a `user_initiates` flag (default False). When set, the task carries no prompt
(instruction=None) and the user simulator delivers the initial sort prompt as its
first turn, then the follow-ups — exercising the framework's new opening-turn path.
The simulator becomes a simple queue replay (opening + follow-ups), behaviorally
identical to before when user_initiates is False.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(alphabet-sort-v1): hardcode the user simulator driving the conversation

Drop the `user_initiates` flag: alphabet-sort always has no prompt on the task and
the simulator drives the whole conversation — it opens with the sort prompt, then
injects the follow-ups. The episode's user turns are stored as a single `user_turns`
queue the simulator replays.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): cache the opening respond("") so a retried first request can't skip it

The opening seed was gated only on `trace.num_turns == 0`. If the first model call
failed (502) before `add_turn` recorded a turn, the harness's OpenAI SDK retried with
a fresh request — re-entering the still-open gate and calling the user simulator's
`respond("")` again. The simulator's queue had already advanced, so the retry injected
the wrong user message and skipped the opening prompt.

Cache the opening `respond("")` result (messages + done) on the session and reuse it
while no turn has been recorded, so `respond` is invoked exactly once and the opening
turn is seeded identically on every retry. The `num_turns == 0` gate still closes the
seed once the first turn lands (the tool-call-interleaving case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread packages/harnesses/harnesses/default/program.py
mikasenghaas and others added 3 commits June 16, 2026 16:35
…coring (#1711)

* feat(v1): add typed transient Trace.state shared across tool/user servers + scoring

Add `Trace.state`: a typed, mutable per-rollout `State` (StateT) that tool servers
(`@vf.tool`) and the user simulator (`respond`) read+write as `self.state` — synced to
the host's authoritative `trace.state` over the interception server per call — and that
`@reward`/`@metric`/`finalize` read+write directly off the trace. Distinct from
`Trace.info`: `state` is transient runtime scratch (counters, game state, the `done`
end-of-trajectory flag), never persisted to disk or sent over the wire; `info` stays the
free-form persisted artifact dict.

- state.py: `State` (strict, mutable, reserved `done`), `StateT` (defaults to `State`),
  and a `state_cls` generic-arg resolver.
- Trace generic over (TaskT, StateT); the `state` field is `exclude=True`. `info`
  unchanged.
- Taskset[TaskT, ConfigT, StateT]; Toolset[ConfigT, StateT]; User[ConfigT, StateT] — all
  default StateT to `State`, so an env that doesn't customize state adds no generic
  boilerplate.
- ServerBase: `self.state` + per-call pull/push sync (`_with_state`) over a new
  interception `/state` GET/PUT channel, wired into servers via
  `VF_STATE_URL`/`VF_STATE_SECRET`.
- `vf.User.respond` now returns `Messages` (not `(Messages, bool)`); end a trajectory by
  setting `self.state.done = True`. The interception loop checks `trace.state.done` (and
  `RolloutSession.refused` checks it before each model call, so a tool can end it too).
- Migrated user sims (echo / alphabet_sort / color_codeword / textarena); added a
  counter-tool-v1 fixture and e2e state tests (in-process + env-server pool).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): make Task.instruction required again (still explicitly nullable)

A task must now set `instruction` — omitting it errors. `None` stays valid and is
the explicit opt-in for the user-simulator-opens-the-conversation path (a taskset
sets `instruction=None` deliberately rather than inheriting it as a default), so
#1710's interception/harness/validation logic keyed on `instruction is None` is
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): make state.done a built-in stop + sync only stateful servers

- `state.done` end-of-trajectory check moves out of the interception server's
  `RolloutSession.refused` (which assumed the state schema) into a built-in
  `Taskset.done` @vf.stop — refused() runs it generically alongside the taskset's
  own stops, so the transport layer no longer special-cases the signal.
- The per-call state channel is now wired only for servers that use shared state:
  a Toolset that declares a custom `State` subclass, or any User (it drives turns
  and ends via state.done). A stateless toolset (base `State`) skips the wrapper,
  the per-call GET/PUT, and — on a remote runtime — the channel tunnel. Gated by
  `ServerBase._uses_state` (overridden True in User) in both `_with_state` and the
  host-side `serve`.
- Make `STATE_TIMEOUT` public with a docstring; tighten the `Trace.state` docstring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): always sync the state channel (drop _uses_state gating)

Every tool/user server in a rollout syncs `self.state` per call again — the
per-call GET/PUT is localhost-cheap in the common case (subprocess/colocated/docker
on the host), so gating it on whether the server declares custom state wasn't worth
the asymmetry. `done` from a base-`State` tool now works without declaring a State
subclass, and the channel wiring is uniform for tools and users.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): move end-of-trajectory fully into user space (no framework done)

The base `vf.State` is now empty — the framework holds no opinion about its
contents. A taskset that ends a trajectory from state declares its own flag and a
`@vf.stop` over it; the interception server no longer references `state.done`
anywhere (the opening-turn and post-turn loops just rely on `refused()` running the
taskset's stops). The stop reason is the `@stop` method's name, so it's informative
and taskset-controlled.

- state.py: drop the reserved `done` field; State is a blank typed canvas.
- taskset.py: drop the built-in `Taskset.done` stop.
- interception/server.py: drop the opening + post-turn `state.done` checks.
- User sims declare their own state + stop: echo/alphabet_sort/color_codeword use
  `user_finished`, textarena uses `game_over`.
- Docs (GUIDE + docstrings) show the field+@Stop pattern.

Verified: alphabet_sort (opening-turn + instruction=None) ends with stop_condition
'user_finished', reward 1.0; test_user/test_tool_state/test_tool green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(v1): warn on last-write-wins state; 400 on mismatched state PUT

- GUIDE + _with_state docstring: the per-call state sync is a whole-object
  read-modify-write, so concurrent tool calls (several tool_calls in one turn)
  are last-write-wins and can lose each other's writes — keep shared-state
  mutations on the sequential path; taskset + servers must share one State.
- handle_state_put: catch pydantic ValidationError and return 400 with the
  reason (a server pushing a shape that doesn't fit the trace's State type —
  usually a StateT mismatch) instead of an unhandled 500.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): also 400 on malformed-JSON state PUT (not just mismatched shape)

Broaden handle_state_put's except to (ValidationError, ValueError) so a
JSONDecodeError from request.json() surfaces as a clean 400 too, not a 500.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(v1): ruff format interception/server.py

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1715)

ServerBase._serve only disabled FastMCP's DNS-rebinding protection when the server
bound 0.0.0.0 (a self-publishing modal/prime runtime). A host-bound server
(127.0.0.1, a subprocess/docker tool) reached by a REMOTE harness over a
host_endpoint tunnel then got 421 Misdirected Request — the guard 421s the tunnel's
Host. This failed the test_tool[in-prime-with-tool-{in-subprocess,in-docker,shared}]
CI cells. Relax the guard unconditionally: these servers are reached only by our own
harness over localhost or our tunnels, never a browser.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): add an `init` scaffolding command for new environments

`uv run init <name>` scaffolds a v1 environment package following the
`environments/*_v1` layout: a `pyproject.toml`, a package whose `__init__.py`
re-exports the plugin via `__all__`, and a runnable `taskset.py` (replace
`load_tasks` + the `@reward`). Parsed with pydantic-config like the other v1
commands (`InitConfig`); the v1 sibling of v0's `vf-init`.

- `--add-tool` / `--add-user` / `--add-harness` scaffold a `vf.Toolset` /
  `vf.User` / `vf.Harness` and wire them in. The harness is exported alongside the
  taskset, selectable via `--harness.id <name>` (the loader filters `__all__` by
  base type, so one package can export both).
- `--v0` scaffolds a legacy v0 `load_environment` package (delegates to
  `verifiers.scripts.init`) for backwards compatibility; rejected with `--add-*`.

Registered as the `init` console script; documented in the README quickstart
(alongside `validate`) and the GUIDE authoring section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): scaffold requires-python >=3.11 to match verifiers core

verifiers requires >=3.11,<3.14; the scaffolded env depends on it, so >=3.10 was
inconsistent. Declare >=3.11.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): make the init scaffold a minimal skeleton

Drop the baked-in demo (the WORDS list + the <answer>-regex exact-match reward).
load_tasks and the @reward are now stubs that raise NotImplementedError — no
task-specific data or scoring opinion in the scaffold — matching v0 vf-init's
spirit. Tool/user/harness wiring is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/cli/init.py
Comment on lines +46 to +56
def _names(name: str) -> tuple[str, str, str, str]:
"""`(dash, pkg, stem, prefix)` derived from a raw name: the hyphenated id, the importable
package (underscores), the `_v1`-less stem (for tool prefixes), and the CamelCase class
prefix (e.g. `my-task-v1` -> `my-task-v1`, `my_task_v1`, `my_task`, `MyTask`)."""
dash = name.strip().strip("/").replace("_", "-").lower()
pkg = dash.replace("-", "_")
stem = pkg[:-3] if pkg.endswith("_v1") else pkg
prefix = "".join(part[:1].upper() + part[1:] for part in stem.split("_") if part)
if not prefix or not prefix[0].isalpha():
prefix = f"Env{prefix}"
return dash, pkg, stem, prefix

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low cli/init.py:46

When name is whitespace-only (e.g., " ") or only slashes (e.g., "/"), the strip() calls on line 50 return an empty string, so pkg becomes empty. This causes env_dir = Path(config.path) / pkg to resolve to the parent directory itself (e.g., ./environments), and scaffolded files like __init__.py and taskset.py are written there instead of a package subdirectory. Consider validating that pkg is non-empty after processing and raising an error for invalid names.

def _names(name: str) -> tuple[str, str, str, str]:
     """`(dash, pkg, stem, prefix)` derived from a raw name: the hyphenated id, the importable
     package (underscores), the `_v1`-less stem (for tool prefixes), and the CamelCase class
     prefix (e.g. `my-task-v1` -> `my-task-v1`, `my_task_v1`, `my_task`, `MyTask`)."""
     dash = name.strip().strip("/").replace("_", "-").lower()
+    if not dash:
+        raise ValueError(f"invalid environment name: {name!r}")
     pkg = dash.replace("-", "_")
     stem = pkg[:-3] if pkg.endswith("_v1") else pkg
     prefix = "".join(part[:1].upper() + part[1:] for part in stem.split("_") if part)
     if not prefix or not prefix[0].isalpha():
         prefix = f"Env{prefix}"
     return dash, pkg, stem, prefix
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/cli/init.py around lines 46-56:

When `name` is whitespace-only (e.g., `"   "`) or only slashes (e.g., `"/"`), the `strip()` calls on line 50 return an empty string, so `pkg` becomes empty. This causes `env_dir = Path(config.path) / pkg` to resolve to the parent directory itself (e.g., `./environments`), and scaffolded files like `__init__.py` and `taskset.py` are written there instead of a package subdirectory. Consider validating that `pkg` is non-empty after processing and raising an error for invalid names.

Evidence trail:
verifiers/v1/cli/init.py lines 46-56 (_names function), line 289 (env_dir = Path(config.path) / pkg), line 290 (pkg_dir = env_dir / pkg), line 339 (if not config.name: check on raw name, not processed pkg). Python Path behavior: Path('a') / '' resolves to Path('a').

mikasenghaas and others added 4 commits June 16, 2026 18:45
* feat(v1): default the harness runtime to subprocess

The harness `runtime` defaulted to `DockerConfig()`, so every eval/train run
without an explicit `--harness.runtime.type` tried to start a container even
though most tasksets just run a local process. Flip the default to
`SubprocessConfig()` — the common, dependency-free case — and let tasksets that
genuinely need a container opt in via `--harness.runtime.type docker` (or
prime/modal). Tasksets carrying a per-task image or `NEEDS_CONTAINER` already
raise a clear error against the subprocess runtime, so the container-requiring
paths stay guarded.

Tool servers and the user simulator already default to subprocess; this aligns
the harness with them.

- Drop the now-redundant `runtime = { type = "subprocess" }` from the example
  configs (alphabet_sort, textarena, wordle, gsm8k_rlm); docker stays explicit
  where it's required (terminal-bench-2, harbor).
- `validate` keeps its docker default (a model-free gold check often needs the
  task's declared container); reword its docs now that they can't say "like eval".
- Update README/GUIDE quickstart + runtime tables to mark subprocess as default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: wrap runtimes import in harness.py (ruff format)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing (#1717)

* fix v1 train client prefix bridging

* feat(v1): token-based prefix reuse in the message graph

Refine prepare_turn's message-hash prefix by token identity at commit: reuse a stored
prefix node only when its tokens match what this turn rendered (longest common token
prefix of the concatenated prefix vs prompt_ids), forking at the first divergence so
retokenization drift surfaces as a branch instead of silent mis-attribution. Bridge path
keeps the prior verbatim (matches fully, stays linear); eval path has no token ids (falls
back to message hash).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): message-level vs renderer-level branching + leaf->root token invariant

Two branching test cases asserting the graph invariant (leaf->root concat == the engine's
prompt_ids + completion_ids): message-level fork via compaction (hash divergence, tokenless),
and a renderer-level break (prior <think> dropped on re-render) that forks only under the train
client (token ids) and is invisible to the eval relay. Document both + the invariant in
ARCHITECTURE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): drop the branching unit tests (keep the token-reuse impl + ARCHITECTURE design notes)

Remove the message-level / renderer-level branching unit tests and their helpers from
test_graph.py, and the test/validation paragraph from the ARCHITECTURE branching section;
branching is exercised end-to-end instead. The graph token-based reuse + the conceptual
ARCHITECTURE notes (branch types + the leaf->root token invariant) stay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(v1): explain why ToolMessage.name exists (GPT-OSS Harmony + bridge)

Most renderers key a tool result off tool_call_id, but GPT-OSS Harmony renders the function
name (functions.<name>, else functions.unknown → broken token parity). The bridge sharpens it:
it renders only the tail, so the issuing assistant's tool call is in the reused prefix and can't
be recovered from the tail — the dialect recovers the name once from the full prompt and carries
it on ToolMessage so later bridge tails have it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): this PR adds no tests (drop the bridge tests from the diff)

Restore test_graph.py to the merge-base and delete test_train_client.py so the PR carries no
test additions; branching/bridge are exercised end-to-end instead. The bridge + token-reuse
implementation is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(v1): avoid O(context) per-token scans in turn commit

The per-turn token-attribution path ran two O(context) per-token Python
builds every turn — synchronous on the env-server worker event loop, so at
multiplex=128 they serialize across rollouts (head-of-line blocking). At
500+ turns near the context cap this is seconds of blocking per rollout.

- _commit_turn: replace the full `stored` concatenation + per-token `while`
  LCP with a node-wise C-level slice compare (short-circuits at the first
  divergent node) — ~8.6x faster (7.0 -> 0.8ms @128k), no full copy.
- previous_token_ids: nested per-token comprehension -> per-node extend
  (~3.4x faster).
- train client: build the (O(context)) previous-turn ids only after the
  cheap bridge guards pass, so non-bridgeable turns don't pay for it.

Behavior-identical: ruff clean, test_graph.py passes, and a full fix-git
train re-run yields the same graph (1 branch, invariant lossless).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): drop add_turn wrapper for prepare_turn/commit

Remove the `add_turn` convenience wrapper; every caller now uses the
explicit two-step `prepare_turn(trace, prompt).commit(response)` — one
obvious way to build the graph. The v0 legacy bridge and the graph tests
are migrated; docstring references updated.

Also add a graph test for the renderer-level break: two turns with the
same message sequence (identical hashes) but a retokenized prior assistant
turn fork by token identity (2 branches), while matching tokens stay linear
(1 branch) — and each branch's leaf->root concat equals its own
prompt_ids + completion_ids.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: eligotts <78387377+eligotts@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(v1): fix test_legacy run_v1 kwarg (runtime -> agent_runtime)

_eval_config (and every test_e2e caller) takes agent_runtime; test_legacy
passed runtime=, raising TypeError when the e2e tests run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: run v1 tests in a separate -n auto -vv step

Split tests/v1 out of the main test step into its own step run with pytest-xdist
(-n auto) and -vv, and exclude it from the main step (--ignore=tests/v1).
Coverage is appended across the two steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): package counter_tool_v1 fixture for sandbox installs

test_tool_state launches the counter toolset inside a docker/prime runtime via
`python -m counter_tool_v1`, which uploads + installs the fixtures package. The
module was missing from the fixtures pyproject `include`, so the sandbox wheel
omitted it and the tool server died with "No module named counter_tool_v1"
(surfacing as "server did not report its port"). Subprocess was unaffected (host
PYTHONPATH). Add it to `include`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): surface the server log when a sandbox tool server never reports its port

The port-file timeout raised a bare "did not report its port", hiding why the
server died (e.g. an import error in the sandbox venv). Mirror the probe-failure
path and append the server log tail, turning an opaque 180s timeout into an
actionable error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(v1): make log_tail a public module-level helper

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…v1 envs (#1721)

* test(envs): run env smoke tests through the unified eval CLI (v0/v1 dispatch)

The env tests assumed the v0 contract (vf.load_environment + vf-eval + hub
tags/README), so every v1 plugin failed. Dispatch on the env style instead:

- Eval via the `eval` CLI: a `_v1` taskset through `--taskset.id`/`--harness.id
  default`, a v0 env through the legacy `--id` bridge. Capped (-n 1 -r 2
  --max-turns 4 --sampling.max-tokens 512 --rich false) so CI stays quick;
  `-r 2` because a taskset with @group_reward(s) needs >=2 rollouts.
- Load check dispatches too: v0 -> load_environment, v1 taskset -> taskset_class,
  the compact harness -> harness_class.
- Metadata: `tags` + README are a v0 hub convention, so they're only required of
  v0 envs; v1 plugins (the `_v1` examples + `compact`) are exempt.
- Skip what can't run in plain CI: the SWE/container v1 tasksets (r2e_gym_v1,
  scaleswe_v1, swelego_v1, terminal_bench_2_v1); `compact` (a harness, not an
  evaluatable taskset); self_reward (group-only rubric the v0 bridge can't score
  per-rollout).
- Run the test-envs job with `-n auto`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: ruff format test_envs.py

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(envs): keep test_envs.py v0-only, add tests/v1/test_envs.py for v1

The v0 env smoke tests (vf.load_environment + vf-eval, tags/README metadata) don't
fit v1 plugins, so filter `get_environments()` to v0 envs only — the `_v1` tasksets
and the `compact` harness are excluded.

Add tests/v1/test_envs.py: smoke-eval every `_v1` taskset in environments/ through
the `eval` CLI (--taskset.id <id> --harness.id default) for one short capped rollout
(-n 1 -r 2 --max-turns 4 --sampling.max-tokens 512), and require success. `compact`
is excluded (a harness, not a taskset); the SWE/container tasksets (r2e_gym_v1,
scaleswe_v1, swelego_v1, terminal_bench_2_v1) skip — they need a docker/prime runtime
and are covered by the dedicated v1 e2e tests.

This supersedes the earlier in-place v0/v1 dispatch in test_envs.py (and the
test-envs -n auto change), keeping the v0 test as it was.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants