feat: vf v1 <> nano bridge by mikasenghaas · Pull Request #1576 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-09T03:49:05Z

README - high level overview
GUIDE - user guide to authoring taskset + harness, cli usage, etc.
ARCHITECTURE - explanation of framework internals

…orts First step of replacing v1 with vf-nano. Deletes verifiers/v1/ wholesale and strips its surface from verifiers/__init__.py (lazy imports, __all__, TYPE_CHECKING) and utils/env_utils.py (load_taskset/load_harness + the typed-config/component machinery). load_environment is now v0-only. Example v1 envs, v1 tests, eval.py v1 path, and docs are removed in follow-up commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…envs Removes the 20 v1-native example envs (tau2_bench_v1, hello_*_v1, bfcl_v3, dspy_*, openenv_*, rlm_swe_v1, sft_replay, mcp_search_env, nemo_gym_env, openai_agents_env, opencode_harbor, langchain_*, wordle_v1, nested_harness_v1) and their *_v1 siblings; removes the v1 test suite (test_v1_*, test_eval_cli, test_wordle_v1_env, test_wiki_search_v1, test_mcp_search_env); strips the v1 flag/branch from the kept v0 envs (reverse_text, alphabet_sort, math_python, wiki_search). Follow-ups: eval.py/init.py v1 paths, remaining v1 test refs, docs, State v1-contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Vendor vf-nano as a submodule under deps/vf-nano and extend the verifiers package __path__ so verifiers.nano imports from it; alias verifiers.v1 -> verifiers.nano 1:1 (verifiers.v1.Trace, .serve.EnvServer, .EnvConfig are the nano objects). Add a v1 extra with nano's runtime + serve deps. One verifiers package now carries both the v0 API and v1 (=nano). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…verse-text-v1) Strip the v1 taskset/harness CLI-override path from scripts/eval.py so vf-eval is v0-only; expose nano's eval as vf-eval-v1 so both run side by side. Bump deps/vf-nano to the reverse-text-v1 rename. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove the v1-only machinery the deleted v1 framework grafted onto State: the _vf_state_contract contract (+ its guards in every dict method), the runtime/endpoint/tools/runtime-handle method cluster (get_model/get_client/get_endpoint_config/get_tools/add_tool/_runtime*/strip_runtime_handles), the for_task borrow/group-state params, and the module-level group-state/borrow helpers. State is now plain v0: dict semantics + _set_* + stop + timing + finalize + _legacy_for_task. Verified: State.for_task/stop/finalize and v0 env load work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…doc) Remove the vf-init --v1/--openenv/--with-harness scaffolding (templates + flags) now that v1 is vf-nano; vf-init is v0-only. Delete the v1-specific test functions (test_imports, test_init_script, test_trajectory_processing) and the v1 harness-authoring doc. Remaining: a docs prose pass (overview/environments/evaluation/reference/training still mention the old v1 API). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

De-submodule vf-nano and vendor it 1:1 into the repo as the verifiers.v1 subpackage, then drop the legacy v1 packages it replaces. - Copy vf-nano (latest main) in: package -> verifiers/v1/, plus examples/, configs/, packages/{tasksets/harbor, harnesses/{default,rlm}}. Remove the deps/vf-nano submodule and the verifiers/__init__ __path__ shim. - verifiers.v1 is now a real subpackage (drop the verifiers/v1.py alias); the v0 -> vf.Trace bridge lives at verifiers.v1.legacy. - Rename nano -> v1 throughout (code, comments, configs); model names like gpt-*-nano / Nemotron-Nano are untouched. - Delete the old-v1 tasksets/harnesses packages and their tests + publish workflows; rework pyproject to source/group the v1 plugins (default-installed), drop the old extras/conflicts, and relax the plugins to >=3.10. - Exclude vendored verifiers/v1 from verifiers' ty gate; restore textarena/nltk in dev so the v0 textarena env type-checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…gins - Scripts: the v1 CLIs are now `eval` / `serve` (was `vf-eval-v1`), matching the CLI's own usage strings and the example config headers. - Move the v1 runtime deps (loguru, tomli-w, renderers) into base `dependencies` and drop the `v1` extra, so `import verifiers.v1` always works. - Shipped plugins are vendored by default (no extras): `tasksets` bundles harbor, `harnesses` bundles default + rlm. Each plugin is a top-level package resolved by id (`import <id>`); example plugins stay standalone under examples/. - Flatten core: verifiers/v1/harnesses/base.py -> verifiers/v1/harness.py; drop the one-module harnesses/ subpackage. - Bump prime-tunnel>=0.1.8, prime-sandboxes>=0.2.27 (latest). - Drop the <3.14 cap from the shipped/example plugin pyprojects. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Drop the "Run Prime sandbox tests" CI step: its tests lived in the removed test_v1_runtime_lifecycle.py, so `pytest -m prime_sandbox` collected nothing and exited 5. - Semgrep job: `uv sync --no-default-groups --group policy` (the plugin groups are default + declared incompatible with policy, so the old `--no-dev` still pulled them and the resolve conflicted). - Drop Python 3.10: requires-python >=3.11 (+ classifier, CI matrix). With renderers/v1 deps in base and example plugins pulling chromadb -> onnxruntime (no 3.10 wheel), 3.10 is no longer supported. - tests/test_envs.py: remove the obsolete v1 tests (alphabet_sort_v1 / test_v1_wrapper_*) and the stale prime-pydantic-config exclude-newer cap that conflicted with renderers' required version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The .semgrep/verifiers.yml policy enforced the old hand-authored v1's conventions: env-authoring rules targeting load_environment(config) shims (the v1 env API is gone), package rules pointing at the old packages/<x>/<x> layout, State methods that were removed, and a canonical-shim exclude list of deleted files — plus typing rules (no Any/Mapping/__future__ annotations) that contradict the vendored vf-nano code (already excluded from the ty gate). Remove the policy wholesale: .semgrep/verifiers.yml, the Semgrep CI job, the `policy` dependency group + its uv conflicts, the pre-commit hook, the now-empty [tool.ruff] exclude, and the dead nosemgrep waivers. A lint policy for the new architecture can be written against vf-nano separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Task gains `system_prompt: str | None`. Harness adds the `APPENDS_SYSTEM_PROMPT` class var + `resolve_prompt`: harnesses that support it emit the system prompt as a real system message (default via program.py; rlm via RLM_APPEND_TO_SYSTEM_PROMPT, which rlm appends to its generated prompt); others fold it into the user instruction with a warning. - default harness adds a one-line bash system prompt (before the task's) only when `enable_bash`. - reverse_text_v1 sets `system_prompt` separately so its prompt is byte-identical to the v0 env ([system, user]) — the model answers directly instead of leaking <think>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The renderer client built its tokenizer/renderer pool from the per-request `model`, which becomes the LoRA adapter name (e.g. `r32-a64.0`) after a weight update — there is no HF tokenizer published under that name, so rollouts 404'd. Add `renderer_model_name` to `RendererClientConfig` (pin it to the base model). The v1 `RendererClient` and the v0 legacy bridge use it for the tokenizer pool while the per-request `model` still selects the sampling target, so LoRA sampling keeps routing by the adapter name. Restores parity with the v0 `ClientConfig.renderer_model_name` wiring used on prime-rl main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The openai_chat_completions client now best-effort parses the prompt and completion token ids and sampling logprobs that vLLM returns (return_token_ids + logprobs) into Response.tokens, so MITO training (no renderer) can train on real on-policy tokens instead of re-tokenizing the messages downstream. Sampling args still pass straight through; tokens stay None when the provider returns neither token ids nor logprobs (e.g. eval, or non-vLLM providers). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The bridge only kept token ids: it dropped the prompt messages, the response message (content / reasoning / tool calls), finish_reason, usage, and the task's system prompt / answer — so a v0-bridged Trace was a near-empty skeleton next to a native v1 Trace. The cause: v0 RolloutOutput nests these as pydantic objects (messages, Response) and records finish_reason on response.message, but the mapping only handled plain dicts and read finish_reason off the response. Coerce v0 objects to dicts before mapping (_as_dict), read finish_reason/usage from their v0 locations, mirror tokens onto the response (as the native client does), and carry the prompt's system_prompt / instruction / answer onto the task. A v0-bridged Trace now matches the native v1 schema (verified by diffing reverse-text rollouts). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Rename every taskset under examples/tasksets/ to a `-v1` id (package name, module, and directory) so they no longer collide with the v0 environments of the same name (gsm8k, wiki-search, math-env, ...) when both are installed in one env. reverse-text-v1 was already suffixed; harbor (a bundled taskset with no v0 counterpart) is left as-is. - examples/tasksets/<x> -> <x>_v1, module <x>.py -> <x>_v1.py; verify.py / server.py / facts.json keep their names (read via __file__, never imported) - package tasksets: inner package wiki_search/wikispeedia -> *_v1, with their self-imports and `-m <pkg>.server` launch paths updated to match - root pyproject [tool.uv.sources] + examples group, and configs/*.toml taskset ids - refresh uv.lock Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add RetryConfig (attempts / include / exclude) on EnvConfig.retry and retry a whole rollout with tenacity when it ends with a captured error — parity with v0's rollout-level retries. Matching is by exception type name; include/exclude name exception classes (e.g. ModelError, ProgramError). Flags: --retry.attempts / --retry.include / --retry.exclude. EvalConfig inherits EnvConfig and the env server runs through Environment.episode, so both eval and training get retries. Retries are first-class on the Trace: `errors` is the list of per-attempt errors (oldest first), and `error` is now a computed field returning the most recent — so a retried-then-failed trace shows every error that led to a retry. Retry utilities live in verifiers/v1/retries.py. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): per-rollout token limits (EnvConfig.max_{input,output,total}_tokens) Add framework-enforced token budgets alongside max_turns: max_input_tokens, max_output_tokens, max_total_tokens on EnvConfig. The interception server checks them before each turn via a new RolloutLimits bundle (which also subsumes max_turns), capping the trace's prompt_len / completion_len / total_tokens computed properties. Reaching any limit refuses the turn and records it as the stop condition, and is_truncated now treats the token-limit conditions as truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(v1): drop 'like max_turns' from token-limit field docstrings Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): trim limit-check comment in interception Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): ruff format interception Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): reclaim orphaned subprocess workspaces A rollout's /tmp workspace is removed in `stop()`, but a process killed mid-rollout (SIGKILL, OOM, hard crash, interrupted teardown) never reaches it, so the workspace leaks with no way to reclaim it — repeated runs eventually fill /tmp ("No space left on device" at mkdtemp). Name each workspace `/tmp/v1-<pid>-*` and, once per process on the first `start()`, sweep `/tmp/v1-<pid>-*` whose pid is no longer alive. PID-keyed, so a concurrent live process's workspaces are never touched; graceful per-rollout cleanup (`stop()`) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): atexit-based runtime teardown; drop the SIGKILL reaper Make resource cleanup a backend-agnostic property of `Runtime`: - a sync `cleanup()` is the teardown source of truth; the public async `stop()` runs it off the event loop on the happy path. - `make_runtime` registers each runtime in a WeakSet and arms one sync `atexit` hook that calls `cleanup()` on anything still live — so a Ctrl-C / SIGTERM that cancels the rollout's `finally` mid-teardown still frees the workspace / container / sandbox, reusing each backend's own cleanup. The hook must be sync: at interpreter shutdown the event loop and its thread-pool are gone, so async teardown raises "cannot schedule new futures". Drop the PID-tagged `reap_orphans` startup sweep. A SIGKILL/OOM runs no in-process code at all, so reclaiming it needs an external mechanism; prime sandboxes already self-terminate via their server-side max-lifetime, and the local subprocess/docker cases are out of scope. Prefix workspaces/containers/scripts with `vf-` (was `v1-`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): delete the prime sandbox in the sync atexit cleanup too `cleanup()` (the atexit backstop) only stopped the tunnels and left the sandbox — the costly resource — to its server-side max-lifetime. prime_sandboxes ships a sync `SandboxClient`, so delete the sandbox synchronously there as well (the async client can't run once the loop is gone). Idempotent with the async `stop` on the normal path: a second delete just 404s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: move teardown comments off the statement line (ruff format) The inline comments pushed two lines past the 88-col limit; moving them above the statement keeps `ruff format` happy without ruff's awkward auto-wrap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): public register/cleanup_at_exit, trim runtime-teardown comments - rename the module-level helpers to public `register` / `cleanup_at_exit` - trim the `_LIVE` block comment and drop the inline "no event loop" why-comments (the `cleanup` docstring already covers why teardown is sync) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…1681) r2e-gym-v1 hardcoded the GAR prefix on every image, which only pulls on runtimes with GCP credentials (e.g. Prime sandboxes); a local docker runtime fails with "denied: Unauthenticated request". Add `R2EGymConfig.use_prime_registry` (default False): images come from the dataset's public Docker Hub `docker_image` (`namanjain12/<repo>_final:<commit>`) unless opted in to the registry. Mirrors the scaleswe-v1 change (#1678). All 4578 R2E-Gym-Subset images are public on Docker Hub, so the default works on any runtime. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…age (#1683) The availability filter checked each task's resolved `image`. With `use_prime_registry=true` that's a private Artifact Registry ref, which `_available_images` can't enumerate anonymously and so keeps unchecked - making the filter a no-op exactly when images are pulled from the GAR. Tasks missing from the GAR (e.g. durandtibo_iden_pr53) then still hit IMAGE_PULL_FAILED. Filter on the dataset's public Docker Hub `image_url` instead, independent of the resolved registry: the GAR mirrors Docker Hub, so the public tag set is the canonical (and only anonymously-checkable) availability signal in both modes. Now drops the 708 missing tags whether or not `use_prime_registry` is set. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Removes the built-in Claude Code harness (added in #1669): deletes `packages/harnesses/harnesses/claude_code/` and its re-export from the `harnesses` package `__init__`. Done as a custom removal rather than `git revert faf7ce1` so the `RetryingClient.relay_aux` passthrough #1669 also added is kept - it's shared aux-relay plumbing (the base/eval `relay_aux` and the interception call predate #1669), and the Anthropic dialect it serves stays in place for Anthropic-native agents. A straight revert would have dropped it. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…1685) * fix(rlm-harness): install/run without root so the subprocess runtime works `uv run eval ... --harness.id rlm --harness.runtime.type subprocess` crashed with `FileNotFoundError: 'rlm'`. The harness forced rlm's installer to `/usr/local/bin` and prepended an unconditional `apt-get`, both root-only; on a non-root host the install silently failed and the bare `rlm` exec then raised FileNotFoundError (the subprocess runtime inherits the host PATH, where rlm wasn't installed). rlm's install.sh already fetches curl/uv itself (via the runtime's package manager, guarded) and defaults its install dir to a user-writable path. So: - Install uv + the rlm CLI into a fixed user-writable dir (`/tmp/vf-rlm/bin`) and run the binary by absolute path - no root, no PATH dependency. Works on a non-root host and a root container alike. - Only `apt-get` for git (needed for the pinned checkout) when it's missing, so a host that already has git needs no root. - Check the install result and raise a clean ProgramError on failure, instead of letting a missing binary surface as an uncaught FileNotFoundError (matches the codex harness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(rlm-harness): flock-serialize install so shared-runtime rollouts don't race Concurrent rollouts on one runtime (subprocess on the host) all clone/install into the same /tmp dirs and clobber each other (git 'destination already exists' / refs-backend abort). Guard the install with flock: the first installs, the rest wait and reuse the binary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(codex-harness): install/run without root, pinned to /tmp/vf-codex Apply the same convention as the rlm harness: install the codex binary into a user-writable /tmp/vf-codex/bin (not root-only /usr/local/bin) and run it by absolute path (not a bare `codex` on $PATH), fetch curl only when missing, and flock-serialize the install so concurrent rollouts sharing one runtime don't race the download. Makes codex work on the subprocess (non-root host) runtime, consistent with rlm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(codex-harness): drop redundant install comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(rlm-harness): drop redundant install comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… region-limited) (#1686) * test(v1): skip own-runtime prime port-exposure e2e cases (region-limited) test_task_tools_own_runtime[prime] / test_user_own_runtime[prime] run a tool / user-sim server in its own prime sandbox, which must publish its port back to the host via native `expose` — currently region-limited (see PrimeRuntime.public_url), with no host-localhost fallback for a port inside a remote sandbox. The old `skip_if_unexposable` only skipped when the trace error contained "port exposure", so any other failure (e.g. provisioning) hard-failed instead. Make it an explicit, upfront skip for the prime case (before provisioning), with a TODO to re-enable once prime supports port exposure in all regions (or the runtime publishes the port via an in-sandbox tunnel). subprocess/docker are unaffected (they share the host network). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): refocus prime port-exposure skip on test_multi_turn The actual failing case is test_multi_turn[*-prime]: its user-sim is colocated in the agent's prime sandbox and host-reachable, so it must publish its port via native expose (region-limited) - but unlike the own-runtime tests it had no skip_if_unexposable guard, so it hard-failed. Add the existing guard to it. Reverts the previous over-broad change to the own-runtime tests (which already had the guard) + the conftest fixture rewrite; only adds a TODO to the fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Support images in v1 tool responses * Add e2e taskset for image tool responses

* Add Harbor task multipliers * Remove Harbor multiplier tests * Remove TerminalBench config hint

* fix(v1): extend bash tool timeout * Increase bash command timeout to 3600 seconds Increase timeout for bash command execution from 60 minutes to 3600 seconds.

* Use Prime CLI config for v1 eval * Gate Prime config by inference URL * Detect Prime inference hosts

* chore: flatten examples/ into a single environments/ section Move the v1 example tasksets (examples/tasksets/*) and the compact harness (examples/harnesses/compact) into the flat environments/ directory, alongside the standalone v0 environments — no more examples/ tree. - [tool.uv.sources]: paths examples/tasksets/<x> -> environments/<x>, examples/harnesses/compact -> environments/compact (package names unchanged) - eval/serve/validate CLIs: the -h example listing now scans environments/ (a single flat list, since tasksets/harnesses are no longer split by dir) - GUIDE/README/loaders doc references updated Package names, the `examples` dependency-group (a curated default-install set, referenced by name not path), and default-groups are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop local_examples help hint Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-16T16:51:07Z

+uv run eval gsm8k-v1 -n 5 -r 3 \
+  --max-turns 8 --max-total-tokens 8192 \                          # per-rollout budgets
+  --retries.model.max-retries 3 --retries.runtime.max-retries 3 \  # retry one failed call
+  --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \  # retry a whole rollout, by error type
+  --timeout.rollout 600 --timeout.scoring 120                      # per-stage wall-clock caps (seconds)
+```


🟢 Low v1/GUIDE.md:280

The bash examples on lines 280-285 place inline comments (# per-rollout budgets, etc.) after \ line continuations. In bash, \ must be immediately followed by a newline to continue the line — any trailing space or comment causes a parse error when the command is copy-pasted. Consider moving the comments above each line or removing them from the continuation lines.

```bash uv run eval gsm8k-v1 -n 5 -r 3 \ - --max-turns 8 --max-total-tokens 8192 \ # per-rollout budgets - --retries.model.max-retries 3 --retries.runtime.max-retries 3 \ # retry one failed call - --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \ # retry a whole rollout, by error type - --timeout.rollout 600 --timeout.scoring 120 # per-stage wall-clock caps (seconds) + --max-turns 8 --max-total-tokens 8192 \ + --retries.model.max-retries 3 --retries.runtime.max-retries 3 \ + --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \ + --timeout.rollout 600 --timeout.scoring 120

<details> <summary>🚀 Reply "<strong>fix it for me</strong>" or copy this <strong>AI Prompt</strong> for your agent:</summary> ```text In file @verifiers/v1/GUIDE.md around lines 280-285: The bash examples on lines 280-285 place inline comments (`# per-rollout budgets`, etc.) after `\` line continuations. In bash, `\` must be immediately followed by a newline to continue the line — any trailing space or comment causes a parse error when the command is copy-pasted. Consider moving the comments above each line or removing them from the continuation lines.

…ment (#1698) * feat(v1): vf-native Toolset/User class surface + per-server runtime placement Author tool/user servers as classes (no FastMCP, no separate server.py): a `vf.Toolset` with `@vf.tool` methods + `setup()`, or a `vf.User` with a single `respond()` hook. `@vf.tool` reuses the existing `mark`/`discover_decorated` machinery; a generic `verifiers.v1.toolserver` launcher serializes the class, rebuilds it in a runtime, and serves it over MCP. Placement (colocated / shared / own runtime) moves onto each server's `config` (`vf.ToolsetConfig` / `vf.UserConfig`), so different servers can run in different runtimes. The default is the server's OWN host (subprocess) runtime: it runs where the eval's deps live and the harness reaches it over the host network (docker --network host) or a tunnel (prime), so a fresh docker/prime sandbox needs nothing installed. The redundant taskset-level tools/user config defaults are removed. Ports all server-bearing examples to the class surface (glossary, wiki_search, wikispeedia, alphabet_sort, color_codeword, textarena); deepwiki stays on `url`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): config-initialized Toolset/User classes + per-data-kind channels Reshape the vf-native surface to mirror Taskset/TasksetConfig: a `Toolset`/`User` is a plain class initialized from its config (`cls(config)`), not a pydantic model holding fields. The config (`ToolsetConfig`/`UserConfig` subclass) is the serializable data; the class is behaviour. This removes the pydantic-on-behaviour awkwardness (per-rollout state is now a plain `self.x`, no `PrivateAttr`). Each kind of data has its own channel, instead of all living on the object: - genuine config (CLI-tunable knobs: placement/runtime, wikispeedia links_only): a `ToolsetConfig`/`UserConfig` subclass — serialized to the server. - global state (facts corpus, wiki graph): module-level or built in `setup` from disk/dataset, server-side — never config. - per-task input (wikispeedia source/target, alphabet_sort follow_ups): read off the rollout's task in `setup(self, task)` — the framework ships the task. - per-rollout mutable state (turns, path, game): plain attrs set in `setup`. The launcher rebuilds `cls(config)` and calls `setup(task)`; `server_to_tools` serializes the config + task (refs + JSON). Examples updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): single internal launcher; drop raw Tools; config polish Internals: one `serve(server, task, agent_runtime, for_host)` launcher handles any vf-native server (Toolset OR User) — colocated or its own runtime, shared or per rollout, with reachability resolved by consumer (host-driven user vs model-called tool). `serve_tools`/`serve_shared`/`serve_user` are now thin wrappers over it (an `AsyncExitStack` for teardown), replacing three near-duplicate implementations. Surface: - Remove the raw `vf.Tools` authoring escape hatch — tools are `vf.Toolset`, users are `vf.User`, only. `Tools` becomes a private `_Launch` descriptor. A remote MCP endpoint is a `vf.Toolset` with `url` on its config (deepwiki). The dead `headers` field is dropped. - `name` is a class `ClassVar` (an identity, like `deps`), not a config field — so a `--taskset.tools.runtime.type docker` override can't drop the tool prefix. - Per-server config registered on the taskset config (`tools` / `user` fields), so placement is CLI-tunable (`--taskset.tools.shared false`, `--taskset.user.runtime.type ...`). - `setup(self, task)` sets plain public instance attrs (no leading underscores). `@vf.tool` no longer takes `priority` (tools are an unordered set). Fixtures/tests: `echo_multi_v1` → `vf.User`; drop `echo_tool_v1` and the two own-runtime matrix tests (a bare sandbox can't run an unpublished vf-native server; that path is covered by the host-side default in the harness x runtime matrix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): render one uv-script per server; drop the -m launcher Unify the launch path on a single rendered PEP 723 uv-script per vf-native server (`server_to_tools` → `_render_script`), `uv run` in any runtime — no separate host `command` path. On a host (subprocess) runtime the script pins `verifiers` + the taskset package to their local editable checkouts via `[tool.uv.sources]`, so it runs from the dev tree with no publishing; in a sandbox those resolve from PyPI. The script is written to a content-addressed path so uv keys one resolved env per distinct script, shared across rollouts. Removes `verifiers/v1/toolserver.py`, the `_Launch.command` field, and `sys.executable` plumbing; `_editable_dist` resolves a top-level module to its (distribution name, editable path). Also move `UserConfig` to `user.py` next to `User` (it was in `tools.py` only for import ordering; `tools.py` never used it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): plain PEP 723 header, no [tool.uv.sources] The rendered server script is now a vanilla uv-script — `# /// script` with a `dependencies = [...]` header and nothing else. The host/sandbox split moves to how it's launched (`serve_in_runtime`): on a subprocess (host) runtime it runs with the eval's own interpreter (deps already installed editable, header ignored, no fetch, no publishing); in any other runtime it's `uv run`, resolving the header from PyPI. Drops the `[tool.uv.sources]` editable-path block and `_editable_dist`; restores the name-only `_server_distribution`. `server_to_tools` no longer takes a runtime type. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * poc(v1): render servers as standalone uv-scripts (vendored runtime, no verifiers) The rendered server script no longer imports `verifiers` or the taskset package. Instead `server_to_tools` vendors a dependency-light runtime (`verifiers/v1/_serverkit.py`, read as source — never imported at serve time) into the script and inlines the server's own config + class source; it reconstructs `cls(config)` against that runtime and serves. So a tool/user server ships as a self-contained PEP 723 uv-script whose only deps are `mcp` + `pydantic` + `uvicorn` + the class's own declared `deps` — all public PyPI — and `uv run`s in any runtime (incl. a fresh sandbox) with nothing pre-installed and no publishing. Drops `_server_distribution`/`_ref`. This requires the server to be self-contained (the boundary contract): it may only touch the runtime, its config, the task, and its declared deps — no taskset module globals or sibling imports. Examples updated accordingly: - glossary: facts move onto the config (server data, shipped as JSON); - wiki_search: the corpus + chroma index build moves into `setup` (deletes corpus.py); - wikispeedia: the SNAP article/link load moves into `setup` (stdlib only); - color_codeword: the square-rendering helpers move into the user class (deps=["pillow"]); - textarena: `latest_feedback` + `OUTCOME_FILE` move onto the user class. Verified: all six render to valid verifiers-free scripts that serve; glossary (1.0) and alphabet_sort (user_completed) pass e2e on the default docker harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): launch tool/user servers via a full-verifiers runtime Replace the rendered, verifiers-free PEP 723 uv-script (a vendored `_serverkit` plus the class source inlined via `inspect.getsource`) with a generic launcher: `python -m verifiers.v1.toolserver` imports the real `Toolset`/`User` class from its installed env module and serves it over MCP. - Host (`subprocess`) runtime: run with the eval's own interpreter — `verifiers` and the env module are already installed, nothing is fetched. - Sandbox runtime: upload the env package and `uv pip install` it (pulling git-pinned `verifiers`, now declared as an env-package dependency) before running the launcher. This lifts the self-containment contract — servers may freely `import verifiers`, import siblings, and use module-level globals — and deletes `_serverkit.py` and the render/inline machinery. The task is reconstructed from its real subclass (`VF_TASK_CLS`), so taskset-specific fields validate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): pin sandbox verifiers to the launcher commit + ensure git Point `_VERIFIERS_PIN` at the pushed commit that has the generic launcher, and install a git client in the sandbox before the git-pinned `verifiers` install (slim base images lack one). Verified end-to-end: glossary-v1 tool server in a docker runtime (in-container install of git-pinned verifiers + the env package) and in modal; alphabet-sort-v1 user simulator on subprocess. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): reach a modal-hosted server via modal's own port forwarding A host-side harness/framework couldn't reach a tool/user server hosted in a modal sandbox: modal publishes sandbox ports (not host ones), but the runtime only implemented `expose` (host -> sandbox via prime_tunnel), so `public_url` fell back to localhost and the connection failed. Implement `public_url` on the modal runtime using modal's native forwarding: reserve a fixed internal service port via `encrypted_ports` at `Sandbox.create` and read its public URL back from `sandbox.tunnels()`. A new `Runtime.published_port` hook lets a self-publishing runtime pre-declare that port; `serve` binds it instead of a free port and the server listens on `0.0.0.0` (MCP_HOST) so the tunnel can forward to it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): relax MCP DNS-rebinding guard for tunnel-hosted servers FastMCP auto-enables DNS-rebinding protection (allowed_hosts=localhost only) when created with the default host, so a server reached via a sandbox tunnel host (e.g. modal's *.modal.host) is rejected with 421 Misdirected Request. When bound to 0.0.0.0 (a self-publishing runtime behind a tunnel), disable the guard — the tunnel is the trust boundary and the client is ours, not a browser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): upload working-tree verifiers source to sandboxes (drop the git pin) Instead of installing a git-pinned verifiers in a sandbox, upload the developer's working-tree verifiers source (its wheel-build inputs) alongside the env package and `uv pip install` both. The sandbox runs the exact local code, so there's no push, no pin to bump, and no git client needed in the base image; deps resolve from PyPI off the uploaded pyproject. Verified end-to-end from an uncommitted tree: glossary-v1 tool server in docker and in modal, reward 1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): tidy vf-native example servers - Rename the `name` ClassVar to `TOOL_PREFIX` (the model-facing tool prefix), default "". - Promote fixed server data from config fields / class attrs to module constants (glossary FACTS, color COLOR_RGB, wiki-search DATASET, textarena OUTCOME_FILE, the vision fixture's PNG_DATA). - Drop the now-dead `deps` ClassVar (deps come from each env package's pyproject) and the redundant placement docstrings on tools/user config fields. - Fix stale docstrings referencing the removed render path / server.py / colocated default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): reach a prime-hosted server via native port exposure Unify modal + prime as self-publishing runtimes: share a fixed `_SERVICE_PORT` returned from `published_port`, so `serve` binds it on 0.0.0.0 and relaxes FastMCP's DNS-rebinding guard (the public sandbox host would otherwise 421). Prime's `public_url` already exposes the port via the SDK (`client.expose` -> `ExposedPort.url`); make modal's service port an internal constant rather than a config knob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): a shared server's setup gets no task (was silently tasks[0]) A `shared` tool server is built once for the whole eval, but `shared_tools`/`serve_shared` passed `tasks[0]` into its `setup` — so a shared server that read the task silently set up from one representative task and served it to every rollout, contradicting the documented contract (`setup`'s task is "None for a shared server"). Pass `None` instead: `server_to_launch` omits VF_TASK/VF_TASK_CLS when there's no task, the launcher hands `setup` `None`, and a shared server that touches the task now fails loudly rather than silently serving one task's data to the whole eval. The shared example (wiki-search) is unaffected — its setup builds the corpus and never reads the task. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): assert a shared server is launched without a task Belt-and-suspenders for the shared-server contract: `serve` raises an informative ValueError if a `shared` server is launched with a task (it must be task-agnostic), instead of relying on its `setup` happening to fail on None. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): trim the _SERVICE_PORT comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): make SERVICE_PORT and TUNNEL_LIMITER public Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): runtime reports is_local; merge expose/public_url; host tunnels caller-side The two runtime network methods were asymmetric: `expose` (reach a HOST port from inside a runtime) was host-side and provider-agnostic — the interception pool even faked a throwaway runtime just to call it — while `public_url` (publish an IN-runtime port) was provider-native. - `Runtime.is_local` (class attr): subprocess/docker True, modal/prime False. - Merge the two into one `Runtime.expose(port)` = publish a port running inside this runtime (modal `tunnels()`, prime `client.expose`); None when local. - `host_endpoint(port, is_local)`: a host-side async context manager that reaches a host port from inside a runtime — localhost when local, else one `prime_tunnel`. The interception pool, rollout, and tool serving call it; the runtime no longer reimplements the tunnel. The pool reads `runtime_is_local(config)` off the runtime class (no throwaway runtime) and owns its server + host tunnel on one AsyncExitStack, instead of one redundant tunnel per remote runtime. Verified e2e: glossary-v1 reward 1.0 on subprocess, docker (harness + tool runtime), and modal tool runtime; modal/prime-as-harness interception (prime_tunnel) untested — prime down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): servers self-launch via their module; split setup/setup_task Drop the generic `toolserver.py` shim — each server module is self-runnable. The framework launches `python -m <cls.__module__>`; the module's `__main__` (or a package `__main__.py`) calls `ServerBase.run()`, which rebuilds the server from the environment (`VF_CONFIG` JSON + `VF_TASK`/`VF_TASK_CLS`, or `cli(config)` for a manual debug run — the config class is read off the `Toolset[Config]` generic) and serves it. This works in any runtime: host (ambient), or a sandbox after `_install_in_sandbox` makes the module importable, reached via `run_background([python, "-m", module])`. Consolidate the launch internals: move the serve loop onto `ServerBase._serve`, inline the former `run_mcp_server` (and drop its stale export), and fold `server_to_launch`/`_Launch` into `serve_in_runtime(server, task, runtime, port)`. Net: `serve_server`, `run_mcp_server`, `server_to_launch`, `_Launch`, and `toolserver.py` are gone. Split the setup hook: `setup(self)` (task-agnostic, runs for every server) + `setup_task(self, task)` (per-rollout, SKIPPED for a shared server). `serve()` warns loudly if a shared server overrides `setup_task` (its per-task logic would never run). Examples migrated; wiki-search's corpus build is now `setup` (shared), wikispeedia/textarena split global vs per-task, the user sims use `setup_task`. Verified e2e on subprocess: glossary 1.0, alphabet-sort user-sim drives multi-turn (stop=user_completed), plus flat-module (`-m glossary_v1`) and package (`-m alphabet_sort_v1` via `__main__.py`) launch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): name the ToolsetConfig placement validator descriptively Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): rename _VERIFIERS_BUILD_INPUTS -> VF_BUILD_INPUTS Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): move tool/user/server code into a verifiers.v1.mcp subpackage Split the cramped `tools.py` + `user.py` into `verifiers/v1/mcp/`: - `server.py` — `ServerBase` (the base authoring class + `run`/`_serve`/`setup`/`setup_task`) - `toolset.py` — `Toolset` + `ToolsetConfig` - `user.py` — `User` + `UserConfig` - `launch.py` — host-side launching: `serve`/`serve_tools`/`serve_shared`/`serve_user`/ `connect_user` + the runtime mechanics (`serve_in_runtime`, `_install_in_sandbox`, …) - `__init__.py` — re-exports the public surface No behavior change. Importers updated (`verifiers.v1`, taskset, rollout, env, interception). The dependency graph is a clean DAG (server ← toolset/user ← launch). Verified e2e on subprocess: glossary-v1 1.0, alphabet-sort-v1 user-sim 1.0 (stop=user_completed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): drop prime cleanup/stop tunnel loops (self._tunnels was removed) cleanup()/stop() still iterated self._tunnels after __init__ stopped initializing it (the prime_tunnel-based expose is gone), which would AttributeError on teardown. Removed the loops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): the env owns serving (shared tools + interception), injected into rollouts Shared tool servers and the interception pool are eval-level resources, but each eval runner stood them up itself: run_eval (in-process) entered both, while the env-server worker pool only entered the interception pool and never set up shared tools. So a shared server ran per rollout *with* a task through the env-server path (the non-rich CLI default and prime-rl's path) - rebuilding an expensive corpus each rollout, and tripping the shared-vs-task assertion ("shared server was launched with a task"). Make the Environment own its serving resources in one place: - Environment.serving(tasks) enters shared_tools + interception_pool and stashes them; Environment.episode() injects them into every Rollout at construction. - Episode.run / Rollout.run / run_with_retry drop their shared_urls/interception params - no runner threads them through anymore. - Both run_eval and EnvServer build episodes inside `async with env.serving(...)`. LegacyEnvServer overrides serving() to a nullcontext (v0 runs its own rollouts). The bug went unnoticed because the e2e suite only exercised run_eval, never the env-server pool. Add a run_v1_server fixture (run_eval_server, static 1-worker pool) and test_shared_tools_via_env_server (glossary-v1 tools.shared=True through the pool) to cover that path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): put the fixture dir on PYTHONPATH so self-launching servers resolve in subprocesses A self-launching tool/user server runs `python -m <module>` in a fresh subprocess. That inherits PYTHONPATH but not pytest's in-process `pythonpath` ini, so a fixture server module (echo_multi_v1, tool_response_image_v1) failed to import there ("No module named ...") while an installed example package (glossary_v1) resolved fine. Add a `pytest_configure` that puts tests/v1/fixtures on PYTHONPATH for spawned subprocesses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): exclude .venv/.git from sandbox source uploads _tar_source (uploads the env package to a docker/prime sandbox) only skipped __pycache__, so an env package whose dir contains a .venv would tarball gigabytes (a .venv is many GB) into an in-memory gzip and stream it over `docker exec -i cat` - effectively an infinite hang on the first docker/prime rollout. Skip a denylist of build/VCS/cache dirs (.venv, .git, .pytest_cache, .mypy_cache, .ruff_cache, node_modules, __pycache__) so only real source ships. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): e2e matrix over server runtime x agent runtime + multimodal VLM Restructure the v1 e2e tests around the three runtimes a rollout places things in - the user simulator's, the tool server's, and the agent (harness) runtime: - test_user (merge of the old test_multi_turn + test_user_sim_placement): a vf.User across user_runtime (colocated / own runtime: subprocess/docker/prime) x agent_runtime. - test_tool (merge of test_tool_placement + test_multi_turn_with_tools + test_shared_tools_via_env_server): a vf.Toolset across tool_runtime (colocated / shared / own runtime) x agent_runtime; the shared case runs through the env-server pool (regression guard for serving shared tools once per eval). - echo_tool_v1 fixture: an echo tool that stamps its output with a token the prompt never reveals, so reward 1.0 proves the tool was reachable and ran. - echo_multi_v1 -> echo_user_sim_v1 (clearer name); drop the now-unused harness_supports fixture. - test_tool_response_image uses a vision model (qwen/qwen3-vl-8b-instruct); the default text model has no image route. - tests/v1/fixtures/pyproject.toml: package the fixtures so a sandbox installs just this dir (its own pyproject) instead of climbing to the repo root and tarring the whole tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): self-describing parametrize ids for the e2e matrix Give every fixture param an explicit id so a case reads as a sentence instead of `[rlm-subprocess]`: - agent runtime -> `in-<rt>-runtime`; harness -> `<name>-harness` - user runtime -> `with-user-colocated` / `with-user-in-<rt>-runtime` - tool runtime -> `with-tool-colocated` / `with-tool-shared` / `with-tool-in-<rt>-runtime` e.g. `test_single_turn[rlm-harness-in-subprocess-runtime]`, `test_tool[in-docker-runtime-with-tool-shared]`. agent_runtime leads the user/tool signatures so the agent's runtime reads first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): drop the redundant -runtime suffix from parametrize ids `in-subprocess-runtime` -> `in-subprocess`, `with-tool-in-docker-runtime` -> `with-tool-in-docker`, etc. Reads the same, less noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): clear error when VF_TASK is set without VF_TASK_CLS + ruff format ServerBase.run() read os.environ["VF_TASK_CLS"] directly, so a VF_TASK without its paired VF_TASK_CLS raised a bare KeyError. The framework always sets both together (launch.py), so this only bites a manual/misconfigured launch - raise a descriptive ValueError instead. Also apply `ruff format` (earlier commits were format-clean under `ruff check` but not `ruff format`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): ruff format interception/pool.py + runtimes/base.py Format-only (line wrapping); these were format-clean under `ruff check` but not `ruff format`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): example envs put each server in its own self-launching servers/<name>.py Separate server code from taskset code: each env's tool/user server moves out of the taskset module into <env>/servers/<name>.py, a self-launching module ending with `if __name__ == "__main__": <Server>.run()` (framework launches `python -m <env>.servers.<name>`). The taskset module imports the server from .servers and uses it in tools()/user(); shared constants/data the server needs live in the server module. Flat envs (glossary, deepwiki) become packages; package envs drop their __main__.py. - glossary -> servers/facts.py (+ facts.json beside it) - deepwiki -> servers/deepwiki.py - alphabet_sort, color_codeword -> servers/user.py - wiki_search, wikispeedia -> servers/wiki.py (wikispeedia keeps graph.py in the package root) GUIDE.md "Tools and user simulators" rewritten to the current vf-native surface (vf.Toolset / vf.User classes, @vf.tool / respond, setup / setup_task, the servers/<name>.py layout, per-server placement with own-host-runtime default; tools + users can coexist). Verified: all 6 envs' server classes resolve to <env>.servers.<name>; glossary (tool) reward 1.0 on subprocess + docker; alphabet-sort (user sim) reward 1.0 on subprocess + docker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): bridge a shared host tool to the host when the harness runs remotely A `shared` tool on a host (subprocess/docker) runtime yielded a plain `http://127.0.0.1:<port>` URL, because serve_shared called serve() with no agent context so serve() took the `else` branch (`expose() or local`). That's reachable from a host-network harness but DEAD to a harness in a prime/modal sandbox — the per-rollout path bridges via host_endpoint, the shared path had no equivalent and nothing validated it (untested: prime was down). Thread the harness runtime's locality into the shared path: Environment.shared_tools passes `runtime_is_local(harness.runtime)` -> serve_shared -> serve(agent_is_local=...), and serve()'s own-runtime/shared branch is unified to `expose(port) or host_endpoint(port, harness_local)`. So a shared host tool now gets one host tunnel (reused by every rollout) when the harness is remote, localhost when it's local, and a remote tool runtime still publishes its own URL. Verified: shared tool + subprocess harness (env-server path) still reward 1.0. The shared + remote harness case mirrors the per-rollout bridge but is still untested (prime infra down). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): address review findings — colocated port clash, connect_user mislabel, config MRO - serve(): a colocated server is reached in-sandbox at localhost, so it now takes a free in-sandbox port instead of the runtime's published_port (a fixed SERVICE_PORT). Two colocated servers sharing one remote sandbox (a colocated tool + user, or two tools) would otherwise both bind SERVICE_PORT and the second's probe would fail. published_port is reserved for actually- exposed ports (a for_host server, or a tool in its own remote runtime) — and since only the one for_host server per rollout ever exposes, modal's single encrypted SERVICE_PORT suffices. - connect_user(): an exception from the harness body (thrown back at the yield) was caught with connected=True and re-wrapped as "connection lost", misdirecting debugging. Track an in_body flag and propagate body exceptions untouched; only genuine transport failures are wrapped. - ServerBase._config_cls(): walk the MRO so a further subclass that doesn't re-parameterize (class B(MyToolset)) inherits its config instead of raising "must parameterize its config". - Docs: note _free_port()'s accepted TOCTOU window (covered by the retryable probe) and that the subprocess API_KEY strip also applies to a task's tool/user server (use its own runtime if it needs a key). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: ruff format textarena_v1 under ruff 0.15.17 CI's ruff-action pins no version so it runs the latest (0.15.17), which formats this file differently than a local 0.15.12 (a blank line + a couple of wraps). Format-only; brings the repo clean under the CI ruff so the Ruff check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): one reachability resolver (reachable_url) for serve + interception The "which URL is reachable from where" logic was open-coded in three places (serve()'s for_host/colocated/own-or-shared branches, the interception pool, and the per-rollout interception fallback), all over the same two primitives. Lift it into a single resolver that owns the table: reachable_url(service, port, *, consumer) # service/consumer each a Runtime or HOST - same place (colocated, or host->host) -> localhost - service in a sandbox (remote runtime) -> its own expose() (reachable anywhere) - service on the host network, consumer remote -> a host_endpoint tunnel `serve` (tools/users), InterceptionPool, and Rollout._serve_interception now all route through it, so port exposure / tunneling lives in one auditable function. The two primitives (Runtime.expose out, host_endpoint in) are unchanged. No behavior change — verified across the full non-prime e2e matrix (server-runtime x agent-runtime, plus interception via every harness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): drop redundant top-level docstrings from env server modules The servers/<name>.py modules just restated their class + a "self-launching python -m ..." line; the class docstring + the GUIDE cover it. Removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): move codex to the agentic matrix (it no-ops on the echo single-turn task) codex is an autonomous coding agent: on `test_single_turn`'s no-op chat echo it often completes its loop without ever calling the model (0 nodes -> reward 0), flakily (some runs/docker it does reply). A stricter prompt didn't help and a lighter reward can't match zero output. On a task with a concrete action it's reliable, so move it from the `harness` fixture (single-turn) to `agentic_harness` (echo-agentic file write) — verified codex reward 1.0 there (subprocess + docker). rlm/kimi-code still cover an agent CLI on the simple single-turn task. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): servers bind an OS-assigned free port and report it back (drop _free_port) _free_port() probed the HOST's 127.0.0.1 for a free port, then handed it to the server — a TOCTOU race, and outright wrong for a colocated tool in a remote sandbox (host-free != sandbox-free; it could even draw SERVICE_PORT). Instead the server now binds its own socket: MCP_PORT when the framework fixed one (a self-publishing runtime's forwarded port), else port 0 — an OS-assigned free port, guaranteed free in whatever environment the server actually runs in. It writes the bound port to MCP_PORT_FILE before setup; serve_in_runtime reads it back (and returns it). Same pattern the interception server already uses (bind 0 + getsockname). _free_port is gone. Verified: full non-prime e2e matrix (test_user + test_tool, subprocess + docker, every placement, incl the cross-boundary port readback) — 15 passed under -n auto. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): textarena user must be colocated (OUTCOME_FILE handoff needs a shared workspace) TextArenaUser writes the game outcome to OUTCOME_FILE via a local open() and game_reward reads it via runtime.read() on the harness rollout's runtime. That only works if the user shares the harness's runtime/workdir — but this branch flipped the UserConfig default to colocated=False, so the user ran in its own workspace and the outcome file was never where scoring looked → reward always 0.0. Pin colocated=True on the taskset's user config (the docstring already assumed it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): textarena raises in setup if its user isn't colocated Belt-and-suspenders on top of the colocated=True default: if someone overrides `--taskset.user.colocated false`, the OUTCOME_FILE handoff silently breaks (reward always 0), so TextArenaUser.setup now fails loudly with the reason instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-16T22:28:04Z

+                os.rename(f"{tar}.part", tar)
+            if not (cache / subdir).exists():
+                with tarfile.open(tar, "r:gz") as t:
+                    t.extractall(cache, filter="data")


🟢 Low servers/wiki.py:42

tarfile.extractall(filter="data") raises TypeError on Python versions before 3.11.4 (and before 3.10.12), because the filter keyword argument did not exist yet. This crashes setup at runtime on those interpreters. Consider wrapping with a try/except TypeError fallback.

- t.extractall(cache, filter="data") + try: + t.extractall(cache, filter="data") + except TypeError: + t.extractall(cache)

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/wikispeedia_v1/wikispeedia_v1/servers/wiki.py around line 42: `tarfile.extractall(filter="data")` raises `TypeError` on Python versions before 3.11.4 (and before 3.10.12), because the `filter` keyword argument did not exist yet. This crashes `setup` at runtime on those interpreters. Consider wrapping with a `try`/`except TypeError` fallback. Evidence trail: - environments/wikispeedia_v1/wikispeedia_v1/servers/wiki.py line 42: `t.extractall(cache, filter="data")` at REVIEWED_COMMIT - pyproject.toml line 14: `requires-python = ">=3.11,<3.14"` at REVIEWED_COMMIT - Python 3.11 docs (https://docs.python.org/uk/3.11/library/tarfile.html): 'Нове в версії 3.11.4' (New in version 3.11.4) for extraction filters - CPython issue #102950 tracking backport to 3.11: https://github.com/python/cpython/issues/102950 - Python docs recommend `hasattr(tarfile, 'data_filter')` for compatibility checking

…arness.py (#1708) * chore(v1): resolve plugins via __all__ export, split into taskset.py/harness.py Replace the per-plugin load_taskset/load_harness hook with an __all__ export. The loader imports a plugin module, walks its __all__, and finds the single Taskset/Harness subclass; config and task types are read off that class's Taskset[TaskT, ConfigT] / Harness[ConfigT] generic (most-derived first, so a thin wrapper that re-binds the config wins). Zero or >1 exported subclasses raise an informative error. Restructure every v1 taskset/harness so __init__.py only re-exports + declares __all__, with the implementation in taskset.py / harness.py. Single-file envs become packages (aux scripts move alongside; hatch build switched to a wheel packages target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop trivial re-export docstrings from plugin __init__.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): rename _plugin_class -> _exported_subclass Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Revert "chore(v1): rename _plugin_class -> _exported_subclass" This reverts commit c45cdc9. Keep the original `_plugin_class` name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…1710) * feat(v1): let the user simulator open the conversation when a task has no prompt A task may now omit its prompt: Task.instruction is optional (default None). When a task carries no prompt and the taskset defines a user simulator, the interception server seeds the simulator's opening turn — respond("") — into the request before the first model call, so the model answers a user message rather than an empty prompt. The existing post-turn loop then drives the remaining turns unchanged. - Task.instruction: str | Messages | None (default None). - dialect.extend accepts a None completion (append only the user turn(s)); used to seed the opening turn before the model has spoken. - Interception server seeds the opening turn, guarded to num_turns == 0 so a later program request (e.g. after a tool call) never re-seeds. - DefaultHarness + its program emit no opening user message for a None instruction; resolve_prompt allows a None instruction. - Validation: a None-instruction task needs a user sim (per-rollout ProgramError); a user-sim taskset needs a SUPPORTS_USER_SIM harness (Environment check, mirroring the existing task-tools check). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(alphabet-sort-v1): demonstrate the user-opens-the-conversation path Add a `user_initiates` flag (default False). When set, the task carries no prompt (instruction=None) and the user simulator delivers the initial sort prompt as its first turn, then the follow-ups — exercising the framework's new opening-turn path. The simulator becomes a simple queue replay (opening + follow-ups), behaviorally identical to before when user_initiates is False. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(alphabet-sort-v1): hardcode the user simulator driving the conversation Drop the `user_initiates` flag: alphabet-sort always has no prompt on the task and the simulator drives the whole conversation — it opens with the sort prompt, then injects the follow-ups. The episode's user turns are stored as a single `user_turns` queue the simulator replays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): cache the opening respond("") so a retried first request can't skip it The opening seed was gated only on `trace.num_turns == 0`. If the first model call failed (502) before `add_turn` recorded a turn, the harness's OpenAI SDK retried with a fresh request — re-entering the still-open gate and calling the user simulator's `respond("")` again. The simulator's queue had already advanced, so the retry injected the wrong user message and skipped the opening prompt. Cache the opening `respond("")` result (messages + done) on the session and reuse it while no turn has been recorded, so `respond` is invoked exactly once and the opening turn is seeded identically on every retry. The `num_turns == 0` gate still closes the seed once the first turn lands (the tool-call-interleaving case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@vf

…coring (#1711) * feat(v1): add typed transient Trace.state shared across tool/user servers + scoring Add `Trace.state`: a typed, mutable per-rollout `State` (StateT) that tool servers (`@vf.tool`) and the user simulator (`respond`) read+write as `self.state` — synced to the host's authoritative `trace.state` over the interception server per call — and that `@reward`/`@metric`/`finalize` read+write directly off the trace. Distinct from `Trace.info`: `state` is transient runtime scratch (counters, game state, the `done` end-of-trajectory flag), never persisted to disk or sent over the wire; `info` stays the free-form persisted artifact dict. - state.py: `State` (strict, mutable, reserved `done`), `StateT` (defaults to `State`), and a `state_cls` generic-arg resolver. - Trace generic over (TaskT, StateT); the `state` field is `exclude=True`. `info` unchanged. - Taskset[TaskT, ConfigT, StateT]; Toolset[ConfigT, StateT]; User[ConfigT, StateT] — all default StateT to `State`, so an env that doesn't customize state adds no generic boilerplate. - ServerBase: `self.state` + per-call pull/push sync (`_with_state`) over a new interception `/state` GET/PUT channel, wired into servers via `VF_STATE_URL`/`VF_STATE_SECRET`. - `vf.User.respond` now returns `Messages` (not `(Messages, bool)`); end a trajectory by setting `self.state.done = True`. The interception loop checks `trace.state.done` (and `RolloutSession.refused` checks it before each model call, so a tool can end it too). - Migrated user sims (echo / alphabet_sort / color_codeword / textarena); added a counter-tool-v1 fixture and e2e state tests (in-process + env-server pool). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): make Task.instruction required again (still explicitly nullable) A task must now set `instruction` — omitting it errors. `None` stays valid and is the explicit opt-in for the user-simulator-opens-the-conversation path (a taskset sets `instruction=None` deliberately rather than inheriting it as a default), so #1710's interception/harness/validation logic keyed on `instruction is None` is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): make state.done a built-in stop + sync only stateful servers - `state.done` end-of-trajectory check moves out of the interception server's `RolloutSession.refused` (which assumed the state schema) into a built-in `Taskset.done` @vf.stop — refused() runs it generically alongside the taskset's own stops, so the transport layer no longer special-cases the signal. - The per-call state channel is now wired only for servers that use shared state: a Toolset that declares a custom `State` subclass, or any User (it drives turns and ends via state.done). A stateless toolset (base `State`) skips the wrapper, the per-call GET/PUT, and — on a remote runtime — the channel tunnel. Gated by `ServerBase._uses_state` (overridden True in User) in both `_with_state` and the host-side `serve`. - Make `STATE_TIMEOUT` public with a docstring; tighten the `Trace.state` docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): always sync the state channel (drop _uses_state gating) Every tool/user server in a rollout syncs `self.state` per call again — the per-call GET/PUT is localhost-cheap in the common case (subprocess/colocated/docker on the host), so gating it on whether the server declares custom state wasn't worth the asymmetry. `done` from a base-`State` tool now works without declaring a State subclass, and the channel wiring is uniform for tools and users. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): move end-of-trajectory fully into user space (no framework done) The base `vf.State` is now empty — the framework holds no opinion about its contents. A taskset that ends a trajectory from state declares its own flag and a `@vf.stop` over it; the interception server no longer references `state.done` anywhere (the opening-turn and post-turn loops just rely on `refused()` running the taskset's stops). The stop reason is the `@stop` method's name, so it's informative and taskset-controlled. - state.py: drop the reserved `done` field; State is a blank typed canvas. - taskset.py: drop the built-in `Taskset.done` stop. - interception/server.py: drop the opening + post-turn `state.done` checks. - User sims declare their own state + stop: echo/alphabet_sort/color_codeword use `user_finished`, textarena uses `game_over`. - Docs (GUIDE + docstrings) show the field+@Stop pattern. Verified: alphabet_sort (opening-turn + instruction=None) ends with stop_condition 'user_finished', reward 1.0; test_user/test_tool_state/test_tool green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(v1): warn on last-write-wins state; 400 on mismatched state PUT - GUIDE + _with_state docstring: the per-call state sync is a whole-object read-modify-write, so concurrent tool calls (several tool_calls in one turn) are last-write-wins and can lose each other's writes — keep shared-state mutations on the sequential path; taskset + servers must share one State. - handle_state_put: catch pydantic ValidationError and return 400 with the reason (a server pushing a shape that doesn't fit the trace's State type — usually a StateT mismatch) instead of an unhandled 500. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): also 400 on malformed-JSON state PUT (not just mismatched shape) Broaden handle_state_put's except to (ValidationError, ValueError) so a JSONDecodeError from request.json() surfaces as a clean 400 too, not a 500. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): ruff format interception/server.py --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

#1715) ServerBase._serve only disabled FastMCP's DNS-rebinding protection when the server bound 0.0.0.0 (a self-publishing modal/prime runtime). A host-bound server (127.0.0.1, a subprocess/docker tool) reached by a REMOTE harness over a host_endpoint tunnel then got 421 Misdirected Request — the guard 421s the tunnel's Host. This failed the test_tool[in-prime-with-tool-{in-subprocess,in-docker,shared}] CI cells. Relax the guard unconditionally: these servers are reached only by our own harness over localhost or our tunnels, never a browser. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@reward

* feat(v1): add an `init` scaffolding command for new environments `uv run init <name>` scaffolds a v1 environment package following the `environments/*_v1` layout: a `pyproject.toml`, a package whose `__init__.py` re-exports the plugin via `__all__`, and a runnable `taskset.py` (replace `load_tasks` + the `@reward`). Parsed with pydantic-config like the other v1 commands (`InitConfig`); the v1 sibling of v0's `vf-init`. - `--add-tool` / `--add-user` / `--add-harness` scaffold a `vf.Toolset` / `vf.User` / `vf.Harness` and wire them in. The harness is exported alongside the taskset, selectable via `--harness.id <name>` (the loader filters `__all__` by base type, so one package can export both). - `--v0` scaffolds a legacy v0 `load_environment` package (delegates to `verifiers.scripts.init`) for backwards compatibility; rejected with `--add-*`. Registered as the `init` console script; documented in the README quickstart (alongside `validate`) and the GUIDE authoring section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): scaffold requires-python >=3.11 to match verifiers core verifiers requires >=3.11,<3.14; the scaffolded env depends on it, so >=3.10 was inconsistent. Declare >=3.11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): make the init scaffold a minimal skeleton Drop the baked-in demo (the WORDS list + the <answer>-regex exact-match reward). load_tasks and the @reward are now stubs that raise NotImplementedError — no task-specific data or scoring opinion in the scaffold — matching v0 vf-init's spirit. Tool/user/harness wiring is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-16T23:46:06Z

+def _names(name: str) -> tuple[str, str, str, str]:
+    """`(dash, pkg, stem, prefix)` derived from a raw name: the hyphenated id, the importable
+    package (underscores), the `_v1`-less stem (for tool prefixes), and the CamelCase class
+    prefix (e.g. `my-task-v1` -> `my-task-v1`, `my_task_v1`, `my_task`, `MyTask`)."""
+    dash = name.strip().strip("/").replace("_", "-").lower()
+    pkg = dash.replace("-", "_")
+    stem = pkg[:-3] if pkg.endswith("_v1") else pkg
+    prefix = "".join(part[:1].upper() + part[1:] for part in stem.split("_") if part)
+    if not prefix or not prefix[0].isalpha():
+        prefix = f"Env{prefix}"
+    return dash, pkg, stem, prefix


🟢 Low cli/init.py:46

When name is whitespace-only (e.g., " ") or only slashes (e.g., "/"), the strip() calls on line 50 return an empty string, so pkg becomes empty. This causes env_dir = Path(config.path) / pkg to resolve to the parent directory itself (e.g., ./environments), and scaffolded files like __init__.py and taskset.py are written there instead of a package subdirectory. Consider validating that pkg is non-empty after processing and raising an error for invalid names.

def _names(name: str) -> tuple[str, str, str, str]: """`(dash, pkg, stem, prefix)` derived from a raw name: the hyphenated id, the importable package (underscores), the `_v1`-less stem (for tool prefixes), and the CamelCase class prefix (e.g. `my-task-v1` -> `my-task-v1`, `my_task_v1`, `my_task`, `MyTask`).""" dash = name.strip().strip("/").replace("_", "-").lower() + if not dash: + raise ValueError(f"invalid environment name: {name!r}") pkg = dash.replace("-", "_") stem = pkg[:-3] if pkg.endswith("_v1") else pkg prefix = "".join(part[:1].upper() + part[1:] for part in stem.split("_") if part) if not prefix or not prefix[0].isalpha(): prefix = f"Env{prefix}" return dash, pkg, stem, prefix

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @verifiers/v1/cli/init.py around lines 46-56: When `name` is whitespace-only (e.g., `" "`) or only slashes (e.g., `"/"`), the `strip()` calls on line 50 return an empty string, so `pkg` becomes empty. This causes `env_dir = Path(config.path) / pkg` to resolve to the parent directory itself (e.g., `./environments`), and scaffolded files like `__init__.py` and `taskset.py` are written there instead of a package subdirectory. Consider validating that `pkg` is non-empty after processing and raising an error for invalid names. Evidence trail: verifiers/v1/cli/init.py lines 46-56 (_names function), line 289 (env_dir = Path(config.path) / pkg), line 290 (pkg_dir = env_dir / pkg), line 339 (if not config.name: check on raw name, not processed pkg). Python Path behavior: Path('a') / '' resolves to Path('a').

* feat(v1): default the harness runtime to subprocess The harness `runtime` defaulted to `DockerConfig()`, so every eval/train run without an explicit `--harness.runtime.type` tried to start a container even though most tasksets just run a local process. Flip the default to `SubprocessConfig()` — the common, dependency-free case — and let tasksets that genuinely need a container opt in via `--harness.runtime.type docker` (or prime/modal). Tasksets carrying a per-task image or `NEEDS_CONTAINER` already raise a clear error against the subprocess runtime, so the container-requiring paths stay guarded. Tool servers and the user simulator already default to subprocess; this aligns the harness with them. - Drop the now-redundant `runtime = { type = "subprocess" }` from the example configs (alphabet_sort, textarena, wordle, gsm8k_rlm); docker stays explicit where it's required (terminal-bench-2, harbor). - `validate` keeps its docker default (a model-free gold check often needs the task's declared container); reword its docs now that they can't say "like eval". - Update README/GUIDE quickstart + runtime tables to mark subprocess as default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: wrap runtimes import in harness.py (ruff format) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@128k

…ing (#1717) * fix v1 train client prefix bridging * feat(v1): token-based prefix reuse in the message graph Refine prepare_turn's message-hash prefix by token identity at commit: reuse a stored prefix node only when its tokens match what this turn rendered (longest common token prefix of the concatenated prefix vs prompt_ids), forking at the first divergence so retokenization drift surfaces as a branch instead of silent mis-attribution. Bridge path keeps the prior verbatim (matches fully, stays linear); eval path has no token ids (falls back to message hash). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): message-level vs renderer-level branching + leaf->root token invariant Two branching test cases asserting the graph invariant (leaf->root concat == the engine's prompt_ids + completion_ids): message-level fork via compaction (hash divergence, tokenless), and a renderer-level break (prior <think> dropped on re-render) that forks only under the train client (token ids) and is invisible to the eval relay. Document both + the invariant in ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): drop the branching unit tests (keep the token-reuse impl + ARCHITECTURE design notes) Remove the message-level / renderer-level branching unit tests and their helpers from test_graph.py, and the test/validation paragraph from the ARCHITECTURE branching section; branching is exercised end-to-end instead. The graph token-based reuse + the conceptual ARCHITECTURE notes (branch types + the leaf->root token invariant) stay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(v1): explain why ToolMessage.name exists (GPT-OSS Harmony + bridge) Most renderers key a tool result off tool_call_id, but GPT-OSS Harmony renders the function name (functions.<name>, else functions.unknown → broken token parity). The bridge sharpens it: it renders only the tail, so the issuing assistant's tool call is in the reused prefix and can't be recovered from the tail — the dialect recovers the name once from the full prompt and carries it on ToolMessage so later bridge tails have it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): this PR adds no tests (drop the bridge tests from the diff) Restore test_graph.py to the merge-base and delete test_train_client.py so the PR carries no test additions; branching/bridge are exercised end-to-end instead. The bridge + token-reuse implementation is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(v1): avoid O(context) per-token scans in turn commit The per-turn token-attribution path ran two O(context) per-token Python builds every turn — synchronous on the env-server worker event loop, so at multiplex=128 they serialize across rollouts (head-of-line blocking). At 500+ turns near the context cap this is seconds of blocking per rollout. - _commit_turn: replace the full `stored` concatenation + per-token `while` LCP with a node-wise C-level slice compare (short-circuits at the first divergent node) — ~8.6x faster (7.0 -> 0.8ms @128k), no full copy. - previous_token_ids: nested per-token comprehension -> per-node extend (~3.4x faster). - train client: build the (O(context)) previous-turn ids only after the cheap bridge guards pass, so non-bridgeable turns don't pay for it. Behavior-identical: ruff clean, test_graph.py passes, and a full fix-git train re-run yields the same graph (1 branch, invariant lossless). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): drop add_turn wrapper for prepare_turn/commit Remove the `add_turn` convenience wrapper; every caller now uses the explicit two-step `prepare_turn(trace, prompt).commit(response)` — one obvious way to build the graph. The v0 legacy bridge and the graph tests are migrated; docstring references updated. Also add a graph test for the renderer-level break: two turns with the same message sequence (identical hashes) but a retokenized prior assistant turn fork by token identity (2 branches), while matching tokens stay linear (1 branch) — and each branch's leaf->root concat equals its own prompt_ids + completion_ids. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: eligotts <78387377+eligotts@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(v1): fix test_legacy run_v1 kwarg (runtime -> agent_runtime) _eval_config (and every test_e2e caller) takes agent_runtime; test_legacy passed runtime=, raising TypeError when the e2e tests run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: run v1 tests in a separate -n auto -vv step Split tests/v1 out of the main test step into its own step run with pytest-xdist (-n auto) and -vv, and exclude it from the main step (--ignore=tests/v1). Coverage is appended across the two steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): package counter_tool_v1 fixture for sandbox installs test_tool_state launches the counter toolset inside a docker/prime runtime via `python -m counter_tool_v1`, which uploads + installs the fixtures package. The module was missing from the fixtures pyproject `include`, so the sandbox wheel omitted it and the tool server died with "No module named counter_tool_v1" (surfacing as "server did not report its port"). Subprocess was unaffected (host PYTHONPATH). Add it to `include`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): surface the server log when a sandbox tool server never reports its port The port-file timeout raised a bare "did not report its port", hiding why the server died (e.g. an import error in the sandbox venv). Mirror the probe-failure path and append the server log tail, turning an opaque 180s timeout into an actionable error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): make log_tail a public module-level helper Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…v1 envs (#1721) * test(envs): run env smoke tests through the unified eval CLI (v0/v1 dispatch) The env tests assumed the v0 contract (vf.load_environment + vf-eval + hub tags/README), so every v1 plugin failed. Dispatch on the env style instead: - Eval via the `eval` CLI: a `_v1` taskset through `--taskset.id`/`--harness.id default`, a v0 env through the legacy `--id` bridge. Capped (-n 1 -r 2 --max-turns 4 --sampling.max-tokens 512 --rich false) so CI stays quick; `-r 2` because a taskset with @group_reward(s) needs >=2 rollouts. - Load check dispatches too: v0 -> load_environment, v1 taskset -> taskset_class, the compact harness -> harness_class. - Metadata: `tags` + README are a v0 hub convention, so they're only required of v0 envs; v1 plugins (the `_v1` examples + `compact`) are exempt. - Skip what can't run in plain CI: the SWE/container v1 tasksets (r2e_gym_v1, scaleswe_v1, swelego_v1, terminal_bench_2_v1); `compact` (a harness, not an evaluatable taskset); self_reward (group-only rubric the v0 bridge can't score per-rollout). - Run the test-envs job with `-n auto`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: ruff format test_envs.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(envs): keep test_envs.py v0-only, add tests/v1/test_envs.py for v1 The v0 env smoke tests (vf.load_environment + vf-eval, tags/README metadata) don't fit v1 plugins, so filter `get_environments()` to v0 envs only — the `_v1` tasksets and the `compact` harness are excluded. Add tests/v1/test_envs.py: smoke-eval every `_v1` taskset in environments/ through the `eval` CLI (--taskset.id <id> --harness.id default) for one short capped rollout (-n 1 -r 2 --max-turns 4 --sampling.max-tokens 512), and require success. `compact` is excluded (a harness, not a taskset); the SWE/container tasksets (r2e_gym_v1, scaleswe_v1, swelego_v1, terminal_bench_2_v1) skip — they need a docker/prime runtime and are covered by the dedicated v1 e2e tests. This supersedes the earlier in-place v0/v1 dispatch in test_envs.py (and the test-envs -n auto change), keeping the v0 test as it was. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas and others added 6 commits June 9, 2026 03:40

chore: bump deps/vf-nano (legacy v0->Trace bridge)

d2189b8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump deps/vf-nano (plugins drop vf-nano dep)

963db28

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas changed the title ~~feat: replace v1 with vf-nano + add v0 legacy bridge~~ feat: v1 v1 <> nano bridge Jun 9, 2026

mikasenghaas changed the title ~~feat: v1 v1 <> nano bridge~~ feat: vf v1 <> nano bridge Jun 9, 2026

mikasenghaas mentioned this pull request Jun 9, 2026

feat: vf v1 <> nano bridge PrimeIntellect-ai/prime-rl#2739

Closed

mikasenghaas and others added 3 commits June 9, 2026 05:00

mikasenghaas changed the title ~~feat: vf v1 <> nano bridge~~ feat: vendor the v1 env library + v0 legacy bridge Jun 9, 2026

mikasenghaas changed the title ~~feat: vendor the v1 env library + v0 legacy bridge~~ feat: vf v1 <> nano bridge Jun 9, 2026

mikasenghaas and others added 3 commits June 9, 2026 07:24

mikasenghaas force-pushed the feat/nano-as-v1 branch from 6ef7ace to 00c7b77 Compare June 9, 2026 08:13

mikasenghaas mentioned this pull request Jun 9, 2026

feat: vf v1 <> nano bridge PrimeIntellect-ai/prime-rl#2742

Draft

mikasenghaas and others added 4 commits June 9, 2026 11:21

macroscopeapp Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread verifiers/v1/trace.py

Comment thread verifiers/v1/legacy.py

macroscopeapp Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread verifiers/v1/trace.py Outdated

macroscopeapp Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread verifiers/v1/runtimes/prime.py

Comment thread verifiers/v1/runtimes/subprocess.py

Comment thread verifiers/v1/runtimes/prime.py

mikasenghaas and others added 16 commits June 14, 2026 18:35

Add mini-swe-agent V1 harness (#1673)

e88bc4e

Fix mini-swe-agent E2E matrix (#1693)

1872b99

Support images in v1 tool responses (#1694)

3c2512c

* Support images in v1 tool responses * Add e2e taskset for image tool responses

map v1 reasoning effort by dialect (#1690)

2822e23

Add Kimi Code V1 harness (#1675)

e906fc1

Ignore unsupported Anthropic service tiers (#1689)

36faabd

Keep relayed SSE streams alive (#1688)

2cab1fe

Add Harbor task multipliers (#1700)

621bb52

* Add Harbor task multipliers * Remove Harbor multiplier tests * Remove TerminalBench config hint

Extend the V1 bash tool timeout (#1701)

b1f827e

* fix(v1): extend bash tool timeout * Increase bash command timeout to 3600 seconds Increase timeout for bash command execution from 60 minutes to 3600 seconds.

Use Prime CLI config for v1 eval (#1703)

68aac71

* Use Prime CLI config for v1 eval * Gate Prime config by inference URL * Detect Prime inference hosts

macroscopeapp Bot reviewed Jun 16, 2026

View reviewed changes

mikasenghaas and others added 2 commits June 16, 2026 15:43

macroscopeapp Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread packages/harnesses/harnesses/default/program.py

mikasenghaas and others added 3 commits June 16, 2026 16:35

macroscopeapp Bot reviewed Jun 16, 2026

View reviewed changes

mikasenghaas and others added 4 commits June 16, 2026 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vf v1 <> nano bridge#1576

feat: vf v1 <> nano bridge#1576
mikasenghaas wants to merge 116 commits into
mainfrom
feat/nano-as-v1

mikasenghaas commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 16, 2026

Uh oh!

macroscopeapp Bot Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mikasenghaas commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mikasenghaas commented Jun 9, 2026 •

edited

Loading