feat: vf v1 <> nano bridge#1576
Conversation
…orts First step of replacing v1 with vf-nano. Deletes verifiers/v1/ wholesale and strips its surface from verifiers/__init__.py (lazy imports, __all__, TYPE_CHECKING) and utils/env_utils.py (load_taskset/load_harness + the typed-config/component machinery). load_environment is now v0-only. Example v1 envs, v1 tests, eval.py v1 path, and docs are removed in follow-up commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…envs Removes the 20 v1-native example envs (tau2_bench_v1, hello_*_v1, bfcl_v3, dspy_*, openenv_*, rlm_swe_v1, sft_replay, mcp_search_env, nemo_gym_env, openai_agents_env, opencode_harbor, langchain_*, wordle_v1, nested_harness_v1) and their *_v1 siblings; removes the v1 test suite (test_v1_*, test_eval_cli, test_wordle_v1_env, test_wiki_search_v1, test_mcp_search_env); strips the v1 flag/branch from the kept v0 envs (reverse_text, alphabet_sort, math_python, wiki_search). Follow-ups: eval.py/init.py v1 paths, remaining v1 test refs, docs, State v1-contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Vendor vf-nano as a submodule under deps/vf-nano and extend the verifiers package __path__ so verifiers.nano imports from it; alias verifiers.v1 -> verifiers.nano 1:1 (verifiers.v1.Trace, .serve.EnvServer, .EnvConfig are the nano objects). Add a v1 extra with nano's runtime + serve deps. One verifiers package now carries both the v0 API and v1 (=nano). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…verse-text-v1) Strip the v1 taskset/harness CLI-override path from scripts/eval.py so vf-eval is v0-only; expose nano's eval as vf-eval-v1 so both run side by side. Bump deps/vf-nano to the reverse-text-v1 rename. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the v1-only machinery the deleted v1 framework grafted onto State: the _vf_state_contract contract (+ its guards in every dict method), the runtime/endpoint/tools/runtime-handle method cluster (get_model/get_client/get_endpoint_config/get_tools/add_tool/_runtime*/strip_runtime_handles), the for_task borrow/group-state params, and the module-level group-state/borrow helpers. State is now plain v0: dict semantics + _set_* + stop + timing + finalize + _legacy_for_task. Verified: State.for_task/stop/finalize and v0 env load work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…doc) Remove the vf-init --v1/--openenv/--with-harness scaffolding (templates + flags) now that v1 is vf-nano; vf-init is v0-only. Delete the v1-specific test functions (test_imports, test_init_script, test_trajectory_processing) and the v1 harness-authoring doc. Remaining: a docs prose pass (overview/environments/evaluation/reference/training still mention the old v1 API). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
De-submodule vf-nano and vendor it 1:1 into the repo as the verifiers.v1
subpackage, then drop the legacy v1 packages it replaces.
- Copy vf-nano (latest main) in: package -> verifiers/v1/, plus examples/,
configs/, packages/{tasksets/harbor, harnesses/{default,rlm}}. Remove the
deps/vf-nano submodule and the verifiers/__init__ __path__ shim.
- verifiers.v1 is now a real subpackage (drop the verifiers/v1.py alias); the
v0 -> vf.Trace bridge lives at verifiers.v1.legacy.
- Rename nano -> v1 throughout (code, comments, configs); model names like
gpt-*-nano / Nemotron-Nano are untouched.
- Delete the old-v1 tasksets/harnesses packages and their tests + publish
workflows; rework pyproject to source/group the v1 plugins (default-installed),
drop the old extras/conflicts, and relax the plugins to >=3.10.
- Exclude vendored verifiers/v1 from verifiers' ty gate; restore textarena/nltk
in dev so the v0 textarena env type-checks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gins - Scripts: the v1 CLIs are now `eval` / `serve` (was `vf-eval-v1`), matching the CLI's own usage strings and the example config headers. - Move the v1 runtime deps (loguru, tomli-w, renderers) into base `dependencies` and drop the `v1` extra, so `import verifiers.v1` always works. - Shipped plugins are vendored by default (no extras): `tasksets` bundles harbor, `harnesses` bundles default + rlm. Each plugin is a top-level package resolved by id (`import <id>`); example plugins stay standalone under examples/. - Flatten core: verifiers/v1/harnesses/base.py -> verifiers/v1/harness.py; drop the one-module harnesses/ subpackage. - Bump prime-tunnel>=0.1.8, prime-sandboxes>=0.2.27 (latest). - Drop the <3.14 cap from the shipped/example plugin pyprojects. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop the "Run Prime sandbox tests" CI step: its tests lived in the removed test_v1_runtime_lifecycle.py, so `pytest -m prime_sandbox` collected nothing and exited 5. - Semgrep job: `uv sync --no-default-groups --group policy` (the plugin groups are default + declared incompatible with policy, so the old `--no-dev` still pulled them and the resolve conflicted). - Drop Python 3.10: requires-python >=3.11 (+ classifier, CI matrix). With renderers/v1 deps in base and example plugins pulling chromadb -> onnxruntime (no 3.10 wheel), 3.10 is no longer supported. - tests/test_envs.py: remove the obsolete v1 tests (alphabet_sort_v1 / test_v1_wrapper_*) and the stale prime-pydantic-config exclude-newer cap that conflicted with renderers' required version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The .semgrep/verifiers.yml policy enforced the old hand-authored v1's conventions: env-authoring rules targeting load_environment(config) shims (the v1 env API is gone), package rules pointing at the old packages/<x>/<x> layout, State methods that were removed, and a canonical-shim exclude list of deleted files — plus typing rules (no Any/Mapping/__future__ annotations) that contradict the vendored vf-nano code (already excluded from the ty gate). Remove the policy wholesale: .semgrep/verifiers.yml, the Semgrep CI job, the `policy` dependency group + its uv conflicts, the pre-commit hook, the now-empty [tool.ruff] exclude, and the dead nosemgrep waivers. A lint policy for the new architecture can be written against vf-nano separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Task gains `system_prompt: str | None`. Harness adds the `APPENDS_SYSTEM_PROMPT` class var + `resolve_prompt`: harnesses that support it emit the system prompt as a real system message (default via program.py; rlm via RLM_APPEND_TO_SYSTEM_PROMPT, which rlm appends to its generated prompt); others fold it into the user instruction with a warning. - default harness adds a one-line bash system prompt (before the task's) only when `enable_bash`. - reverse_text_v1 sets `system_prompt` separately so its prompt is byte-identical to the v0 env ([system, user]) — the model answers directly instead of leaking <think>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6ef7ace to
00c7b77
Compare
The renderer client built its tokenizer/renderer pool from the per-request `model`, which becomes the LoRA adapter name (e.g. `r32-a64.0`) after a weight update — there is no HF tokenizer published under that name, so rollouts 404'd. Add `renderer_model_name` to `RendererClientConfig` (pin it to the base model). The v1 `RendererClient` and the v0 legacy bridge use it for the tokenizer pool while the per-request `model` still selects the sampling target, so LoRA sampling keeps routing by the adapter name. Restores parity with the v0 `ClientConfig.renderer_model_name` wiring used on prime-rl main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The openai_chat_completions client now best-effort parses the prompt and completion token ids and sampling logprobs that vLLM returns (return_token_ids + logprobs) into Response.tokens, so MITO training (no renderer) can train on real on-policy tokens instead of re-tokenizing the messages downstream. Sampling args still pass straight through; tokens stay None when the provider returns neither token ids nor logprobs (e.g. eval, or non-vLLM providers). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bridge only kept token ids: it dropped the prompt messages, the response message (content / reasoning / tool calls), finish_reason, usage, and the task's system prompt / answer — so a v0-bridged Trace was a near-empty skeleton next to a native v1 Trace. The cause: v0 RolloutOutput nests these as pydantic objects (messages, Response) and records finish_reason on response.message, but the mapping only handled plain dicts and read finish_reason off the response. Coerce v0 objects to dicts before mapping (_as_dict), read finish_reason/usage from their v0 locations, mirror tokens onto the response (as the native client does), and carry the prompt's system_prompt / instruction / answer onto the task. A v0-bridged Trace now matches the native v1 schema (verified by diffing reverse-text rollouts). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Rename every taskset under examples/tasksets/ to a `-v1` id (package name, module, and directory) so they no longer collide with the v0 environments of the same name (gsm8k, wiki-search, math-env, ...) when both are installed in one env. reverse-text-v1 was already suffixed; harbor (a bundled taskset with no v0 counterpart) is left as-is. - examples/tasksets/<x> -> <x>_v1, module <x>.py -> <x>_v1.py; verify.py / server.py / facts.json keep their names (read via __file__, never imported) - package tasksets: inner package wiki_search/wikispeedia -> *_v1, with their self-imports and `-m <pkg>.server` launch paths updated to match - root pyproject [tool.uv.sources] + examples group, and configs/*.toml taskset ids - refresh uv.lock Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add RetryConfig (attempts / include / exclude) on EnvConfig.retry and retry a whole rollout with tenacity when it ends with a captured error — parity with v0's rollout-level retries. Matching is by exception type name; include/exclude name exception classes (e.g. ModelError, ProgramError). Flags: --retry.attempts / --retry.include / --retry.exclude. EvalConfig inherits EnvConfig and the env server runs through Environment.episode, so both eval and training get retries. Retries are first-class on the Trace: `errors` is the list of per-attempt errors (oldest first), and `error` is now a computed field returning the most recent — so a retried-then-failed trace shows every error that led to a retry. Retry utilities live in verifiers/v1/retries.py. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): per-rollout token limits (EnvConfig.max_{input,output,total}_tokens)
Add framework-enforced token budgets alongside max_turns: max_input_tokens,
max_output_tokens, max_total_tokens on EnvConfig. The interception server checks
them before each turn via a new RolloutLimits bundle (which also subsumes
max_turns), capping the trace's prompt_len / completion_len / total_tokens
computed properties. Reaching any limit refuses the turn and records it as the
stop condition, and is_truncated now treats the token-limit conditions as
truncation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(v1): drop 'like max_turns' from token-limit field docstrings
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* style(v1): trim limit-check comment in interception
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* style(v1): ruff format interception
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): reclaim orphaned subprocess workspaces
A rollout's /tmp workspace is removed in `stop()`, but a process killed mid-rollout
(SIGKILL, OOM, hard crash, interrupted teardown) never reaches it, so the workspace
leaks with no way to reclaim it — repeated runs eventually fill /tmp ("No space left
on device" at mkdtemp).
Name each workspace `/tmp/v1-<pid>-*` and, once per process on the first `start()`,
sweep `/tmp/v1-<pid>-*` whose pid is no longer alive. PID-keyed, so a concurrent live
process's workspaces are never touched; graceful per-rollout cleanup (`stop()`) is
unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(v1): atexit-based runtime teardown; drop the SIGKILL reaper
Make resource cleanup a backend-agnostic property of `Runtime`:
- a sync `cleanup()` is the teardown source of truth; the public async `stop()` runs it
off the event loop on the happy path.
- `make_runtime` registers each runtime in a WeakSet and arms one sync `atexit` hook that
calls `cleanup()` on anything still live — so a Ctrl-C / SIGTERM that cancels the
rollout's `finally` mid-teardown still frees the workspace / container / sandbox, reusing
each backend's own cleanup. The hook must be sync: at interpreter shutdown the event loop
and its thread-pool are gone, so async teardown raises "cannot schedule new futures".
Drop the PID-tagged `reap_orphans` startup sweep. A SIGKILL/OOM runs no in-process code at
all, so reclaiming it needs an external mechanism; prime sandboxes already self-terminate
via their server-side max-lifetime, and the local subprocess/docker cases are out of scope.
Prefix workspaces/containers/scripts with `vf-` (was `v1-`).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): delete the prime sandbox in the sync atexit cleanup too
`cleanup()` (the atexit backstop) only stopped the tunnels and left the sandbox — the
costly resource — to its server-side max-lifetime. prime_sandboxes ships a sync
`SandboxClient`, so delete the sandbox synchronously there as well (the async client can't
run once the loop is gone). Idempotent with the async `stop` on the normal path: a second
delete just 404s.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* style: move teardown comments off the statement line (ruff format)
The inline comments pushed two lines past the 88-col limit; moving them above the
statement keeps `ruff format` happy without ruff's awkward auto-wrap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(v1): public register/cleanup_at_exit, trim runtime-teardown comments
- rename the module-level helpers to public `register` / `cleanup_at_exit`
- trim the `_LIVE` block comment and drop the inline "no event loop" why-comments
(the `cleanup` docstring already covers why teardown is sync)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1681) r2e-gym-v1 hardcoded the GAR prefix on every image, which only pulls on runtimes with GCP credentials (e.g. Prime sandboxes); a local docker runtime fails with "denied: Unauthenticated request". Add `R2EGymConfig.use_prime_registry` (default False): images come from the dataset's public Docker Hub `docker_image` (`namanjain12/<repo>_final:<commit>`) unless opted in to the registry. Mirrors the scaleswe-v1 change (#1678). All 4578 R2E-Gym-Subset images are public on Docker Hub, so the default works on any runtime. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…age (#1683) The availability filter checked each task's resolved `image`. With `use_prime_registry=true` that's a private Artifact Registry ref, which `_available_images` can't enumerate anonymously and so keeps unchecked - making the filter a no-op exactly when images are pulled from the GAR. Tasks missing from the GAR (e.g. durandtibo_iden_pr53) then still hit IMAGE_PULL_FAILED. Filter on the dataset's public Docker Hub `image_url` instead, independent of the resolved registry: the GAR mirrors Docker Hub, so the public tag set is the canonical (and only anonymously-checkable) availability signal in both modes. Now drops the 708 missing tags whether or not `use_prime_registry` is set. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removes the built-in Claude Code harness (added in #1669): deletes `packages/harnesses/harnesses/claude_code/` and its re-export from the `harnesses` package `__init__`. Done as a custom removal rather than `git revert faf7ce1` so the `RetryingClient.relay_aux` passthrough #1669 also added is kept - it's shared aux-relay plumbing (the base/eval `relay_aux` and the interception call predate #1669), and the Anthropic dialect it serves stays in place for Anthropic-native agents. A straight revert would have dropped it. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1685) * fix(rlm-harness): install/run without root so the subprocess runtime works `uv run eval ... --harness.id rlm --harness.runtime.type subprocess` crashed with `FileNotFoundError: 'rlm'`. The harness forced rlm's installer to `/usr/local/bin` and prepended an unconditional `apt-get`, both root-only; on a non-root host the install silently failed and the bare `rlm` exec then raised FileNotFoundError (the subprocess runtime inherits the host PATH, where rlm wasn't installed). rlm's install.sh already fetches curl/uv itself (via the runtime's package manager, guarded) and defaults its install dir to a user-writable path. So: - Install uv + the rlm CLI into a fixed user-writable dir (`/tmp/vf-rlm/bin`) and run the binary by absolute path - no root, no PATH dependency. Works on a non-root host and a root container alike. - Only `apt-get` for git (needed for the pinned checkout) when it's missing, so a host that already has git needs no root. - Check the install result and raise a clean ProgramError on failure, instead of letting a missing binary surface as an uncaught FileNotFoundError (matches the codex harness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(rlm-harness): flock-serialize install so shared-runtime rollouts don't race Concurrent rollouts on one runtime (subprocess on the host) all clone/install into the same /tmp dirs and clobber each other (git 'destination already exists' / refs-backend abort). Guard the install with flock: the first installs, the rest wait and reuse the binary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(codex-harness): install/run without root, pinned to /tmp/vf-codex Apply the same convention as the rlm harness: install the codex binary into a user-writable /tmp/vf-codex/bin (not root-only /usr/local/bin) and run it by absolute path (not a bare `codex` on $PATH), fetch curl only when missing, and flock-serialize the install so concurrent rollouts sharing one runtime don't race the download. Makes codex work on the subprocess (non-root host) runtime, consistent with rlm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(codex-harness): drop redundant install comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(rlm-harness): drop redundant install comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… region-limited) (#1686) * test(v1): skip own-runtime prime port-exposure e2e cases (region-limited) test_task_tools_own_runtime[prime] / test_user_own_runtime[prime] run a tool / user-sim server in its own prime sandbox, which must publish its port back to the host via native `expose` — currently region-limited (see PrimeRuntime.public_url), with no host-localhost fallback for a port inside a remote sandbox. The old `skip_if_unexposable` only skipped when the trace error contained "port exposure", so any other failure (e.g. provisioning) hard-failed instead. Make it an explicit, upfront skip for the prime case (before provisioning), with a TODO to re-enable once prime supports port exposure in all regions (or the runtime publishes the port via an in-sandbox tunnel). subprocess/docker are unaffected (they share the host network). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): refocus prime port-exposure skip on test_multi_turn The actual failing case is test_multi_turn[*-prime]: its user-sim is colocated in the agent's prime sandbox and host-reachable, so it must publish its port via native expose (region-limited) - but unlike the own-runtime tests it had no skip_if_unexposable guard, so it hard-failed. Add the existing guard to it. Reverts the previous over-broad change to the own-runtime tests (which already had the guard) + the conftest fixture rewrite; only adds a TODO to the fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Support images in v1 tool responses * Add e2e taskset for image tool responses
* Add Harbor task multipliers * Remove Harbor multiplier tests * Remove TerminalBench config hint
* fix(v1): extend bash tool timeout * Increase bash command timeout to 3600 seconds Increase timeout for bash command execution from 60 minutes to 3600 seconds.
* Use Prime CLI config for v1 eval * Gate Prime config by inference URL * Detect Prime inference hosts
* chore: flatten examples/ into a single environments/ section Move the v1 example tasksets (examples/tasksets/*) and the compact harness (examples/harnesses/compact) into the flat environments/ directory, alongside the standalone v0 environments — no more examples/ tree. - [tool.uv.sources]: paths examples/tasksets/<x> -> environments/<x>, examples/harnesses/compact -> environments/compact (package names unchanged) - eval/serve/validate CLIs: the -h example listing now scans environments/ (a single flat list, since tasksets/harnesses are no longer split by dir) - GUIDE/README/loaders doc references updated Package names, the `examples` dependency-group (a curated default-install set, referenced by name not path), and default-groups are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop local_examples help hint Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| uv run eval gsm8k-v1 -n 5 -r 3 \ | ||
| --max-turns 8 --max-total-tokens 8192 \ # per-rollout budgets | ||
| --retries.model.max-retries 3 --retries.runtime.max-retries 3 \ # retry one failed call | ||
| --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \ # retry a whole rollout, by error type | ||
| --timeout.rollout 600 --timeout.scoring 120 # per-stage wall-clock caps (seconds) | ||
| ``` |
There was a problem hiding this comment.
🟢 Low v1/GUIDE.md:280
The bash examples on lines 280-285 place inline comments (# per-rollout budgets, etc.) after \ line continuations. In bash, \ must be immediately followed by a newline to continue the line — any trailing space or comment causes a parse error when the command is copy-pasted. Consider moving the comments above each line or removing them from the continuation lines.
```bash
uv run eval gsm8k-v1 -n 5 -r 3 \
- --max-turns 8 --max-total-tokens 8192 \ # per-rollout budgets
- --retries.model.max-retries 3 --retries.runtime.max-retries 3 \ # retry one failed call
- --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \ # retry a whole rollout, by error type
- --timeout.rollout 600 --timeout.scoring 120 # per-stage wall-clock caps (seconds)
+ --max-turns 8 --max-total-tokens 8192 \
+ --retries.model.max-retries 3 --retries.runtime.max-retries 3 \
+ --retries.rollout.max-retries 3 --retries.rollout.include ProgramError \
+ --timeout.rollout 600 --timeout.scoring 120
<details>
<summary>🚀 Reply "<strong>fix it for me</strong>" or copy this <strong>AI Prompt</strong> for your agent:</summary>
```text
In file @verifiers/v1/GUIDE.md around lines 280-285:
The bash examples on lines 280-285 place inline comments (`# per-rollout budgets`, etc.) after `\` line continuations. In bash, `\` must be immediately followed by a newline to continue the line — any trailing space or comment causes a parse error when the command is copy-pasted. Consider moving the comments above each line or removing them from the continuation lines.
…ment (#1698) * feat(v1): vf-native Toolset/User class surface + per-server runtime placement Author tool/user servers as classes (no FastMCP, no separate server.py): a `vf.Toolset` with `@vf.tool` methods + `setup()`, or a `vf.User` with a single `respond()` hook. `@vf.tool` reuses the existing `mark`/`discover_decorated` machinery; a generic `verifiers.v1.toolserver` launcher serializes the class, rebuilds it in a runtime, and serves it over MCP. Placement (colocated / shared / own runtime) moves onto each server's `config` (`vf.ToolsetConfig` / `vf.UserConfig`), so different servers can run in different runtimes. The default is the server's OWN host (subprocess) runtime: it runs where the eval's deps live and the harness reaches it over the host network (docker --network host) or a tunnel (prime), so a fresh docker/prime sandbox needs nothing installed. The redundant taskset-level tools/user config defaults are removed. Ports all server-bearing examples to the class surface (glossary, wiki_search, wikispeedia, alphabet_sort, color_codeword, textarena); deepwiki stays on `url`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): config-initialized Toolset/User classes + per-data-kind channels Reshape the vf-native surface to mirror Taskset/TasksetConfig: a `Toolset`/`User` is a plain class initialized from its config (`cls(config)`), not a pydantic model holding fields. The config (`ToolsetConfig`/`UserConfig` subclass) is the serializable data; the class is behaviour. This removes the pydantic-on-behaviour awkwardness (per-rollout state is now a plain `self.x`, no `PrivateAttr`). Each kind of data has its own channel, instead of all living on the object: - genuine config (CLI-tunable knobs: placement/runtime, wikispeedia links_only): a `ToolsetConfig`/`UserConfig` subclass — serialized to the server. - global state (facts corpus, wiki graph): module-level or built in `setup` from disk/dataset, server-side — never config. - per-task input (wikispeedia source/target, alphabet_sort follow_ups): read off the rollout's task in `setup(self, task)` — the framework ships the task. - per-rollout mutable state (turns, path, game): plain attrs set in `setup`. The launcher rebuilds `cls(config)` and calls `setup(task)`; `server_to_tools` serializes the config + task (refs + JSON). Examples updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): single internal launcher; drop raw Tools; config polish Internals: one `serve(server, task, agent_runtime, for_host)` launcher handles any vf-native server (Toolset OR User) — colocated or its own runtime, shared or per rollout, with reachability resolved by consumer (host-driven user vs model-called tool). `serve_tools`/`serve_shared`/`serve_user` are now thin wrappers over it (an `AsyncExitStack` for teardown), replacing three near-duplicate implementations. Surface: - Remove the raw `vf.Tools` authoring escape hatch — tools are `vf.Toolset`, users are `vf.User`, only. `Tools` becomes a private `_Launch` descriptor. A remote MCP endpoint is a `vf.Toolset` with `url` on its config (deepwiki). The dead `headers` field is dropped. - `name` is a class `ClassVar` (an identity, like `deps`), not a config field — so a `--taskset.tools.runtime.type docker` override can't drop the tool prefix. - Per-server config registered on the taskset config (`tools` / `user` fields), so placement is CLI-tunable (`--taskset.tools.shared false`, `--taskset.user.runtime.type ...`). - `setup(self, task)` sets plain public instance attrs (no leading underscores). `@vf.tool` no longer takes `priority` (tools are an unordered set). Fixtures/tests: `echo_multi_v1` → `vf.User`; drop `echo_tool_v1` and the two own-runtime matrix tests (a bare sandbox can't run an unpublished vf-native server; that path is covered by the host-side default in the harness x runtime matrix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): render one uv-script per server; drop the -m launcher Unify the launch path on a single rendered PEP 723 uv-script per vf-native server (`server_to_tools` → `_render_script`), `uv run` in any runtime — no separate host `command` path. On a host (subprocess) runtime the script pins `verifiers` + the taskset package to their local editable checkouts via `[tool.uv.sources]`, so it runs from the dev tree with no publishing; in a sandbox those resolve from PyPI. The script is written to a content-addressed path so uv keys one resolved env per distinct script, shared across rollouts. Removes `verifiers/v1/toolserver.py`, the `_Launch.command` field, and `sys.executable` plumbing; `_editable_dist` resolves a top-level module to its (distribution name, editable path). Also move `UserConfig` to `user.py` next to `User` (it was in `tools.py` only for import ordering; `tools.py` never used it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): plain PEP 723 header, no [tool.uv.sources] The rendered server script is now a vanilla uv-script — `# /// script` with a `dependencies = [...]` header and nothing else. The host/sandbox split moves to how it's launched (`serve_in_runtime`): on a subprocess (host) runtime it runs with the eval's own interpreter (deps already installed editable, header ignored, no fetch, no publishing); in any other runtime it's `uv run`, resolving the header from PyPI. Drops the `[tool.uv.sources]` editable-path block and `_editable_dist`; restores the name-only `_server_distribution`. `server_to_tools` no longer takes a runtime type. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * poc(v1): render servers as standalone uv-scripts (vendored runtime, no verifiers) The rendered server script no longer imports `verifiers` or the taskset package. Instead `server_to_tools` vendors a dependency-light runtime (`verifiers/v1/_serverkit.py`, read as source — never imported at serve time) into the script and inlines the server's own config + class source; it reconstructs `cls(config)` against that runtime and serves. So a tool/user server ships as a self-contained PEP 723 uv-script whose only deps are `mcp` + `pydantic` + `uvicorn` + the class's own declared `deps` — all public PyPI — and `uv run`s in any runtime (incl. a fresh sandbox) with nothing pre-installed and no publishing. Drops `_server_distribution`/`_ref`. This requires the server to be self-contained (the boundary contract): it may only touch the runtime, its config, the task, and its declared deps — no taskset module globals or sibling imports. Examples updated accordingly: - glossary: facts move onto the config (server data, shipped as JSON); - wiki_search: the corpus + chroma index build moves into `setup` (deletes corpus.py); - wikispeedia: the SNAP article/link load moves into `setup` (stdlib only); - color_codeword: the square-rendering helpers move into the user class (deps=["pillow"]); - textarena: `latest_feedback` + `OUTCOME_FILE` move onto the user class. Verified: all six render to valid verifiers-free scripts that serve; glossary (1.0) and alphabet_sort (user_completed) pass e2e on the default docker harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): launch tool/user servers via a full-verifiers runtime Replace the rendered, verifiers-free PEP 723 uv-script (a vendored `_serverkit` plus the class source inlined via `inspect.getsource`) with a generic launcher: `python -m verifiers.v1.toolserver` imports the real `Toolset`/`User` class from its installed env module and serves it over MCP. - Host (`subprocess`) runtime: run with the eval's own interpreter — `verifiers` and the env module are already installed, nothing is fetched. - Sandbox runtime: upload the env package and `uv pip install` it (pulling git-pinned `verifiers`, now declared as an env-package dependency) before running the launcher. This lifts the self-containment contract — servers may freely `import verifiers`, import siblings, and use module-level globals — and deletes `_serverkit.py` and the render/inline machinery. The task is reconstructed from its real subclass (`VF_TASK_CLS`), so taskset-specific fields validate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): pin sandbox verifiers to the launcher commit + ensure git Point `_VERIFIERS_PIN` at the pushed commit that has the generic launcher, and install a git client in the sandbox before the git-pinned `verifiers` install (slim base images lack one). Verified end-to-end: glossary-v1 tool server in a docker runtime (in-container install of git-pinned verifiers + the env package) and in modal; alphabet-sort-v1 user simulator on subprocess. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): reach a modal-hosted server via modal's own port forwarding A host-side harness/framework couldn't reach a tool/user server hosted in a modal sandbox: modal publishes sandbox ports (not host ones), but the runtime only implemented `expose` (host -> sandbox via prime_tunnel), so `public_url` fell back to localhost and the connection failed. Implement `public_url` on the modal runtime using modal's native forwarding: reserve a fixed internal service port via `encrypted_ports` at `Sandbox.create` and read its public URL back from `sandbox.tunnels()`. A new `Runtime.published_port` hook lets a self-publishing runtime pre-declare that port; `serve` binds it instead of a free port and the server listens on `0.0.0.0` (MCP_HOST) so the tunnel can forward to it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): relax MCP DNS-rebinding guard for tunnel-hosted servers FastMCP auto-enables DNS-rebinding protection (allowed_hosts=localhost only) when created with the default host, so a server reached via a sandbox tunnel host (e.g. modal's *.modal.host) is rejected with 421 Misdirected Request. When bound to 0.0.0.0 (a self-publishing runtime behind a tunnel), disable the guard — the tunnel is the trust boundary and the client is ours, not a browser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): upload working-tree verifiers source to sandboxes (drop the git pin) Instead of installing a git-pinned verifiers in a sandbox, upload the developer's working-tree verifiers source (its wheel-build inputs) alongside the env package and `uv pip install` both. The sandbox runs the exact local code, so there's no push, no pin to bump, and no git client needed in the base image; deps resolve from PyPI off the uploaded pyproject. Verified end-to-end from an uncommitted tree: glossary-v1 tool server in docker and in modal, reward 1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): tidy vf-native example servers - Rename the `name` ClassVar to `TOOL_PREFIX` (the model-facing tool prefix), default "". - Promote fixed server data from config fields / class attrs to module constants (glossary FACTS, color COLOR_RGB, wiki-search DATASET, textarena OUTCOME_FILE, the vision fixture's PNG_DATA). - Drop the now-dead `deps` ClassVar (deps come from each env package's pyproject) and the redundant placement docstrings on tools/user config fields. - Fix stale docstrings referencing the removed render path / server.py / colocated default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): reach a prime-hosted server via native port exposure Unify modal + prime as self-publishing runtimes: share a fixed `_SERVICE_PORT` returned from `published_port`, so `serve` binds it on 0.0.0.0 and relaxes FastMCP's DNS-rebinding guard (the public sandbox host would otherwise 421). Prime's `public_url` already exposes the port via the SDK (`client.expose` -> `ExposedPort.url`); make modal's service port an internal constant rather than a config knob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): a shared server's setup gets no task (was silently tasks[0]) A `shared` tool server is built once for the whole eval, but `shared_tools`/`serve_shared` passed `tasks[0]` into its `setup` — so a shared server that read the task silently set up from one representative task and served it to every rollout, contradicting the documented contract (`setup`'s task is "None for a shared server"). Pass `None` instead: `server_to_launch` omits VF_TASK/VF_TASK_CLS when there's no task, the launcher hands `setup` `None`, and a shared server that touches the task now fails loudly rather than silently serving one task's data to the whole eval. The shared example (wiki-search) is unaffected — its setup builds the corpus and never reads the task. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): assert a shared server is launched without a task Belt-and-suspenders for the shared-server contract: `serve` raises an informative ValueError if a `shared` server is launched with a task (it must be task-agnostic), instead of relying on its `setup` happening to fail on None. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): trim the _SERVICE_PORT comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): make SERVICE_PORT and TUNNEL_LIMITER public Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): runtime reports is_local; merge expose/public_url; host tunnels caller-side The two runtime network methods were asymmetric: `expose` (reach a HOST port from inside a runtime) was host-side and provider-agnostic — the interception pool even faked a throwaway runtime just to call it — while `public_url` (publish an IN-runtime port) was provider-native. - `Runtime.is_local` (class attr): subprocess/docker True, modal/prime False. - Merge the two into one `Runtime.expose(port)` = publish a port running inside this runtime (modal `tunnels()`, prime `client.expose`); None when local. - `host_endpoint(port, is_local)`: a host-side async context manager that reaches a host port from inside a runtime — localhost when local, else one `prime_tunnel`. The interception pool, rollout, and tool serving call it; the runtime no longer reimplements the tunnel. The pool reads `runtime_is_local(config)` off the runtime class (no throwaway runtime) and owns its server + host tunnel on one AsyncExitStack, instead of one redundant tunnel per remote runtime. Verified e2e: glossary-v1 reward 1.0 on subprocess, docker (harness + tool runtime), and modal tool runtime; modal/prime-as-harness interception (prime_tunnel) untested — prime down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): servers self-launch via their module; split setup/setup_task Drop the generic `toolserver.py` shim — each server module is self-runnable. The framework launches `python -m <cls.__module__>`; the module's `__main__` (or a package `__main__.py`) calls `ServerBase.run()`, which rebuilds the server from the environment (`VF_CONFIG` JSON + `VF_TASK`/`VF_TASK_CLS`, or `cli(config)` for a manual debug run — the config class is read off the `Toolset[Config]` generic) and serves it. This works in any runtime: host (ambient), or a sandbox after `_install_in_sandbox` makes the module importable, reached via `run_background([python, "-m", module])`. Consolidate the launch internals: move the serve loop onto `ServerBase._serve`, inline the former `run_mcp_server` (and drop its stale export), and fold `server_to_launch`/`_Launch` into `serve_in_runtime(server, task, runtime, port)`. Net: `serve_server`, `run_mcp_server`, `server_to_launch`, `_Launch`, and `toolserver.py` are gone. Split the setup hook: `setup(self)` (task-agnostic, runs for every server) + `setup_task(self, task)` (per-rollout, SKIPPED for a shared server). `serve()` warns loudly if a shared server overrides `setup_task` (its per-task logic would never run). Examples migrated; wiki-search's corpus build is now `setup` (shared), wikispeedia/textarena split global vs per-task, the user sims use `setup_task`. Verified e2e on subprocess: glossary 1.0, alphabet-sort user-sim drives multi-turn (stop=user_completed), plus flat-module (`-m glossary_v1`) and package (`-m alphabet_sort_v1` via `__main__.py`) launch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): name the ToolsetConfig placement validator descriptively Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): rename _VERIFIERS_BUILD_INPUTS -> VF_BUILD_INPUTS Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): move tool/user/server code into a verifiers.v1.mcp subpackage Split the cramped `tools.py` + `user.py` into `verifiers/v1/mcp/`: - `server.py` — `ServerBase` (the base authoring class + `run`/`_serve`/`setup`/`setup_task`) - `toolset.py` — `Toolset` + `ToolsetConfig` - `user.py` — `User` + `UserConfig` - `launch.py` — host-side launching: `serve`/`serve_tools`/`serve_shared`/`serve_user`/ `connect_user` + the runtime mechanics (`serve_in_runtime`, `_install_in_sandbox`, …) - `__init__.py` — re-exports the public surface No behavior change. Importers updated (`verifiers.v1`, taskset, rollout, env, interception). The dependency graph is a clean DAG (server ← toolset/user ← launch). Verified e2e on subprocess: glossary-v1 1.0, alphabet-sort-v1 user-sim 1.0 (stop=user_completed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): drop prime cleanup/stop tunnel loops (self._tunnels was removed) cleanup()/stop() still iterated self._tunnels after __init__ stopped initializing it (the prime_tunnel-based expose is gone), which would AttributeError on teardown. Removed the loops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): the env owns serving (shared tools + interception), injected into rollouts Shared tool servers and the interception pool are eval-level resources, but each eval runner stood them up itself: run_eval (in-process) entered both, while the env-server worker pool only entered the interception pool and never set up shared tools. So a shared server ran per rollout *with* a task through the env-server path (the non-rich CLI default and prime-rl's path) - rebuilding an expensive corpus each rollout, and tripping the shared-vs-task assertion ("shared server was launched with a task"). Make the Environment own its serving resources in one place: - Environment.serving(tasks) enters shared_tools + interception_pool and stashes them; Environment.episode() injects them into every Rollout at construction. - Episode.run / Rollout.run / run_with_retry drop their shared_urls/interception params - no runner threads them through anymore. - Both run_eval and EnvServer build episodes inside `async with env.serving(...)`. LegacyEnvServer overrides serving() to a nullcontext (v0 runs its own rollouts). The bug went unnoticed because the e2e suite only exercised run_eval, never the env-server pool. Add a run_v1_server fixture (run_eval_server, static 1-worker pool) and test_shared_tools_via_env_server (glossary-v1 tools.shared=True through the pool) to cover that path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): put the fixture dir on PYTHONPATH so self-launching servers resolve in subprocesses A self-launching tool/user server runs `python -m <module>` in a fresh subprocess. That inherits PYTHONPATH but not pytest's in-process `pythonpath` ini, so a fixture server module (echo_multi_v1, tool_response_image_v1) failed to import there ("No module named ...") while an installed example package (glossary_v1) resolved fine. Add a `pytest_configure` that puts tests/v1/fixtures on PYTHONPATH for spawned subprocesses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): exclude .venv/.git from sandbox source uploads _tar_source (uploads the env package to a docker/prime sandbox) only skipped __pycache__, so an env package whose dir contains a .venv would tarball gigabytes (a .venv is many GB) into an in-memory gzip and stream it over `docker exec -i cat` - effectively an infinite hang on the first docker/prime rollout. Skip a denylist of build/VCS/cache dirs (.venv, .git, .pytest_cache, .mypy_cache, .ruff_cache, node_modules, __pycache__) so only real source ships. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): e2e matrix over server runtime x agent runtime + multimodal VLM Restructure the v1 e2e tests around the three runtimes a rollout places things in - the user simulator's, the tool server's, and the agent (harness) runtime: - test_user (merge of the old test_multi_turn + test_user_sim_placement): a vf.User across user_runtime (colocated / own runtime: subprocess/docker/prime) x agent_runtime. - test_tool (merge of test_tool_placement + test_multi_turn_with_tools + test_shared_tools_via_env_server): a vf.Toolset across tool_runtime (colocated / shared / own runtime) x agent_runtime; the shared case runs through the env-server pool (regression guard for serving shared tools once per eval). - echo_tool_v1 fixture: an echo tool that stamps its output with a token the prompt never reveals, so reward 1.0 proves the tool was reachable and ran. - echo_multi_v1 -> echo_user_sim_v1 (clearer name); drop the now-unused harness_supports fixture. - test_tool_response_image uses a vision model (qwen/qwen3-vl-8b-instruct); the default text model has no image route. - tests/v1/fixtures/pyproject.toml: package the fixtures so a sandbox installs just this dir (its own pyproject) instead of climbing to the repo root and tarring the whole tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): self-describing parametrize ids for the e2e matrix Give every fixture param an explicit id so a case reads as a sentence instead of `[rlm-subprocess]`: - agent runtime -> `in-<rt>-runtime`; harness -> `<name>-harness` - user runtime -> `with-user-colocated` / `with-user-in-<rt>-runtime` - tool runtime -> `with-tool-colocated` / `with-tool-shared` / `with-tool-in-<rt>-runtime` e.g. `test_single_turn[rlm-harness-in-subprocess-runtime]`, `test_tool[in-docker-runtime-with-tool-shared]`. agent_runtime leads the user/tool signatures so the agent's runtime reads first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): drop the redundant -runtime suffix from parametrize ids `in-subprocess-runtime` -> `in-subprocess`, `with-tool-in-docker-runtime` -> `with-tool-in-docker`, etc. Reads the same, less noise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): clear error when VF_TASK is set without VF_TASK_CLS + ruff format ServerBase.run() read os.environ["VF_TASK_CLS"] directly, so a VF_TASK without its paired VF_TASK_CLS raised a bare KeyError. The framework always sets both together (launch.py), so this only bites a manual/misconfigured launch - raise a descriptive ValueError instead. Also apply `ruff format` (earlier commits were format-clean under `ruff check` but not `ruff format`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): ruff format interception/pool.py + runtimes/base.py Format-only (line wrapping); these were format-clean under `ruff check` but not `ruff format`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): example envs put each server in its own self-launching servers/<name>.py Separate server code from taskset code: each env's tool/user server moves out of the taskset module into <env>/servers/<name>.py, a self-launching module ending with `if __name__ == "__main__": <Server>.run()` (framework launches `python -m <env>.servers.<name>`). The taskset module imports the server from .servers and uses it in tools()/user(); shared constants/data the server needs live in the server module. Flat envs (glossary, deepwiki) become packages; package envs drop their __main__.py. - glossary -> servers/facts.py (+ facts.json beside it) - deepwiki -> servers/deepwiki.py - alphabet_sort, color_codeword -> servers/user.py - wiki_search, wikispeedia -> servers/wiki.py (wikispeedia keeps graph.py in the package root) GUIDE.md "Tools and user simulators" rewritten to the current vf-native surface (vf.Toolset / vf.User classes, @vf.tool / respond, setup / setup_task, the servers/<name>.py layout, per-server placement with own-host-runtime default; tools + users can coexist). Verified: all 6 envs' server classes resolve to <env>.servers.<name>; glossary (tool) reward 1.0 on subprocess + docker; alphabet-sort (user sim) reward 1.0 on subprocess + docker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): bridge a shared host tool to the host when the harness runs remotely A `shared` tool on a host (subprocess/docker) runtime yielded a plain `http://127.0.0.1:<port>` URL, because serve_shared called serve() with no agent context so serve() took the `else` branch (`expose() or local`). That's reachable from a host-network harness but DEAD to a harness in a prime/modal sandbox — the per-rollout path bridges via host_endpoint, the shared path had no equivalent and nothing validated it (untested: prime was down). Thread the harness runtime's locality into the shared path: Environment.shared_tools passes `runtime_is_local(harness.runtime)` -> serve_shared -> serve(agent_is_local=...), and serve()'s own-runtime/shared branch is unified to `expose(port) or host_endpoint(port, harness_local)`. So a shared host tool now gets one host tunnel (reused by every rollout) when the harness is remote, localhost when it's local, and a remote tool runtime still publishes its own URL. Verified: shared tool + subprocess harness (env-server path) still reward 1.0. The shared + remote harness case mirrors the per-rollout bridge but is still untested (prime infra down). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): address review findings — colocated port clash, connect_user mislabel, config MRO - serve(): a colocated server is reached in-sandbox at localhost, so it now takes a free in-sandbox port instead of the runtime's published_port (a fixed SERVICE_PORT). Two colocated servers sharing one remote sandbox (a colocated tool + user, or two tools) would otherwise both bind SERVICE_PORT and the second's probe would fail. published_port is reserved for actually- exposed ports (a for_host server, or a tool in its own remote runtime) — and since only the one for_host server per rollout ever exposes, modal's single encrypted SERVICE_PORT suffices. - connect_user(): an exception from the harness body (thrown back at the yield) was caught with connected=True and re-wrapped as "connection lost", misdirecting debugging. Track an in_body flag and propagate body exceptions untouched; only genuine transport failures are wrapped. - ServerBase._config_cls(): walk the MRO so a further subclass that doesn't re-parameterize (class B(MyToolset)) inherits its config instead of raising "must parameterize its config". - Docs: note _free_port()'s accepted TOCTOU window (covered by the retryable probe) and that the subprocess API_KEY strip also applies to a task's tool/user server (use its own runtime if it needs a key). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: ruff format textarena_v1 under ruff 0.15.17 CI's ruff-action pins no version so it runs the latest (0.15.17), which formats this file differently than a local 0.15.12 (a blank line + a couple of wraps). Format-only; brings the repo clean under the CI ruff so the Ruff check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): one reachability resolver (reachable_url) for serve + interception The "which URL is reachable from where" logic was open-coded in three places (serve()'s for_host/colocated/own-or-shared branches, the interception pool, and the per-rollout interception fallback), all over the same two primitives. Lift it into a single resolver that owns the table: reachable_url(service, port, *, consumer) # service/consumer each a Runtime or HOST - same place (colocated, or host->host) -> localhost - service in a sandbox (remote runtime) -> its own expose() (reachable anywhere) - service on the host network, consumer remote -> a host_endpoint tunnel `serve` (tools/users), InterceptionPool, and Rollout._serve_interception now all route through it, so port exposure / tunneling lives in one auditable function. The two primitives (Runtime.expose out, host_endpoint in) are unchanged. No behavior change — verified across the full non-prime e2e matrix (server-runtime x agent-runtime, plus interception via every harness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): drop redundant top-level docstrings from env server modules The servers/<name>.py modules just restated their class + a "self-launching python -m ..." line; the class docstring + the GUIDE cover it. Removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): move codex to the agentic matrix (it no-ops on the echo single-turn task) codex is an autonomous coding agent: on `test_single_turn`'s no-op chat echo it often completes its loop without ever calling the model (0 nodes -> reward 0), flakily (some runs/docker it does reply). A stricter prompt didn't help and a lighter reward can't match zero output. On a task with a concrete action it's reliable, so move it from the `harness` fixture (single-turn) to `agentic_harness` (echo-agentic file write) — verified codex reward 1.0 there (subprocess + docker). rlm/kimi-code still cover an agent CLI on the simple single-turn task. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): servers bind an OS-assigned free port and report it back (drop _free_port) _free_port() probed the HOST's 127.0.0.1 for a free port, then handed it to the server — a TOCTOU race, and outright wrong for a colocated tool in a remote sandbox (host-free != sandbox-free; it could even draw SERVICE_PORT). Instead the server now binds its own socket: MCP_PORT when the framework fixed one (a self-publishing runtime's forwarded port), else port 0 — an OS-assigned free port, guaranteed free in whatever environment the server actually runs in. It writes the bound port to MCP_PORT_FILE before setup; serve_in_runtime reads it back (and returns it). Same pattern the interception server already uses (bind 0 + getsockname). _free_port is gone. Verified: full non-prime e2e matrix (test_user + test_tool, subprocess + docker, every placement, incl the cross-boundary port readback) — 15 passed under -n auto. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): textarena user must be colocated (OUTCOME_FILE handoff needs a shared workspace) TextArenaUser writes the game outcome to OUTCOME_FILE via a local open() and game_reward reads it via runtime.read() on the harness rollout's runtime. That only works if the user shares the harness's runtime/workdir — but this branch flipped the UserConfig default to colocated=False, so the user ran in its own workspace and the outcome file was never where scoring looked → reward always 0.0. Pin colocated=True on the taskset's user config (the docstring already assumed it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): textarena raises in setup if its user isn't colocated Belt-and-suspenders on top of the colocated=True default: if someone overrides `--taskset.user.colocated false`, the OUTCOME_FILE handoff silently breaks (reward always 0), so TextArenaUser.setup now fails loudly with the reason instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| os.rename(f"{tar}.part", tar) | ||
| if not (cache / subdir).exists(): | ||
| with tarfile.open(tar, "r:gz") as t: | ||
| t.extractall(cache, filter="data") |
There was a problem hiding this comment.
🟢 Low servers/wiki.py:42
tarfile.extractall(filter="data") raises TypeError on Python versions before 3.11.4 (and before 3.10.12), because the filter keyword argument did not exist yet. This crashes setup at runtime on those interpreters. Consider wrapping with a try/except TypeError fallback.
- t.extractall(cache, filter="data")
+ try:
+ t.extractall(cache, filter="data")
+ except TypeError:
+ t.extractall(cache)🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/wikispeedia_v1/wikispeedia_v1/servers/wiki.py around line 42:
`tarfile.extractall(filter="data")` raises `TypeError` on Python versions before 3.11.4 (and before 3.10.12), because the `filter` keyword argument did not exist yet. This crashes `setup` at runtime on those interpreters. Consider wrapping with a `try`/`except TypeError` fallback.
Evidence trail:
- environments/wikispeedia_v1/wikispeedia_v1/servers/wiki.py line 42: `t.extractall(cache, filter="data")` at REVIEWED_COMMIT
- pyproject.toml line 14: `requires-python = ">=3.11,<3.14"` at REVIEWED_COMMIT
- Python 3.11 docs (https://docs.python.org/uk/3.11/library/tarfile.html): 'Нове в версії 3.11.4' (New in version 3.11.4) for extraction filters
- CPython issue #102950 tracking backport to 3.11: https://github.com/python/cpython/issues/102950
- Python docs recommend `hasattr(tarfile, 'data_filter')` for compatibility checking
…arness.py (#1708) * chore(v1): resolve plugins via __all__ export, split into taskset.py/harness.py Replace the per-plugin load_taskset/load_harness hook with an __all__ export. The loader imports a plugin module, walks its __all__, and finds the single Taskset/Harness subclass; config and task types are read off that class's Taskset[TaskT, ConfigT] / Harness[ConfigT] generic (most-derived first, so a thin wrapper that re-binds the config wins). Zero or >1 exported subclasses raise an informative error. Restructure every v1 taskset/harness so __init__.py only re-exports + declares __all__, with the implementation in taskset.py / harness.py. Single-file envs become packages (aux scripts move alongside; hatch build switched to a wheel packages target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop trivial re-export docstrings from plugin __init__.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): rename _plugin_class -> _exported_subclass Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Revert "chore(v1): rename _plugin_class -> _exported_subclass" This reverts commit c45cdc9. Keep the original `_plugin_class` name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1710) * feat(v1): let the user simulator open the conversation when a task has no prompt A task may now omit its prompt: Task.instruction is optional (default None). When a task carries no prompt and the taskset defines a user simulator, the interception server seeds the simulator's opening turn — respond("") — into the request before the first model call, so the model answers a user message rather than an empty prompt. The existing post-turn loop then drives the remaining turns unchanged. - Task.instruction: str | Messages | None (default None). - dialect.extend accepts a None completion (append only the user turn(s)); used to seed the opening turn before the model has spoken. - Interception server seeds the opening turn, guarded to num_turns == 0 so a later program request (e.g. after a tool call) never re-seeds. - DefaultHarness + its program emit no opening user message for a None instruction; resolve_prompt allows a None instruction. - Validation: a None-instruction task needs a user sim (per-rollout ProgramError); a user-sim taskset needs a SUPPORTS_USER_SIM harness (Environment check, mirroring the existing task-tools check). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(alphabet-sort-v1): demonstrate the user-opens-the-conversation path Add a `user_initiates` flag (default False). When set, the task carries no prompt (instruction=None) and the user simulator delivers the initial sort prompt as its first turn, then the follow-ups — exercising the framework's new opening-turn path. The simulator becomes a simple queue replay (opening + follow-ups), behaviorally identical to before when user_initiates is False. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(alphabet-sort-v1): hardcode the user simulator driving the conversation Drop the `user_initiates` flag: alphabet-sort always has no prompt on the task and the simulator drives the whole conversation — it opens with the sort prompt, then injects the follow-ups. The episode's user turns are stored as a single `user_turns` queue the simulator replays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): cache the opening respond("") so a retried first request can't skip it The opening seed was gated only on `trace.num_turns == 0`. If the first model call failed (502) before `add_turn` recorded a turn, the harness's OpenAI SDK retried with a fresh request — re-entering the still-open gate and calling the user simulator's `respond("")` again. The simulator's queue had already advanced, so the retry injected the wrong user message and skipped the opening prompt. Cache the opening `respond("")` result (messages + done) on the session and reuse it while no turn has been recorded, so `respond` is invoked exactly once and the opening turn is seeded identically on every retry. The `num_turns == 0` gate still closes the seed once the first turn lands (the tool-call-interleaving case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…coring (#1711) * feat(v1): add typed transient Trace.state shared across tool/user servers + scoring Add `Trace.state`: a typed, mutable per-rollout `State` (StateT) that tool servers (`@vf.tool`) and the user simulator (`respond`) read+write as `self.state` — synced to the host's authoritative `trace.state` over the interception server per call — and that `@reward`/`@metric`/`finalize` read+write directly off the trace. Distinct from `Trace.info`: `state` is transient runtime scratch (counters, game state, the `done` end-of-trajectory flag), never persisted to disk or sent over the wire; `info` stays the free-form persisted artifact dict. - state.py: `State` (strict, mutable, reserved `done`), `StateT` (defaults to `State`), and a `state_cls` generic-arg resolver. - Trace generic over (TaskT, StateT); the `state` field is `exclude=True`. `info` unchanged. - Taskset[TaskT, ConfigT, StateT]; Toolset[ConfigT, StateT]; User[ConfigT, StateT] — all default StateT to `State`, so an env that doesn't customize state adds no generic boilerplate. - ServerBase: `self.state` + per-call pull/push sync (`_with_state`) over a new interception `/state` GET/PUT channel, wired into servers via `VF_STATE_URL`/`VF_STATE_SECRET`. - `vf.User.respond` now returns `Messages` (not `(Messages, bool)`); end a trajectory by setting `self.state.done = True`. The interception loop checks `trace.state.done` (and `RolloutSession.refused` checks it before each model call, so a tool can end it too). - Migrated user sims (echo / alphabet_sort / color_codeword / textarena); added a counter-tool-v1 fixture and e2e state tests (in-process + env-server pool). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): make Task.instruction required again (still explicitly nullable) A task must now set `instruction` — omitting it errors. `None` stays valid and is the explicit opt-in for the user-simulator-opens-the-conversation path (a taskset sets `instruction=None` deliberately rather than inheriting it as a default), so #1710's interception/harness/validation logic keyed on `instruction is None` is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): make state.done a built-in stop + sync only stateful servers - `state.done` end-of-trajectory check moves out of the interception server's `RolloutSession.refused` (which assumed the state schema) into a built-in `Taskset.done` @vf.stop — refused() runs it generically alongside the taskset's own stops, so the transport layer no longer special-cases the signal. - The per-call state channel is now wired only for servers that use shared state: a Toolset that declares a custom `State` subclass, or any User (it drives turns and ends via state.done). A stateless toolset (base `State`) skips the wrapper, the per-call GET/PUT, and — on a remote runtime — the channel tunnel. Gated by `ServerBase._uses_state` (overridden True in User) in both `_with_state` and the host-side `serve`. - Make `STATE_TIMEOUT` public with a docstring; tighten the `Trace.state` docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): always sync the state channel (drop _uses_state gating) Every tool/user server in a rollout syncs `self.state` per call again — the per-call GET/PUT is localhost-cheap in the common case (subprocess/colocated/docker on the host), so gating it on whether the server declares custom state wasn't worth the asymmetry. `done` from a base-`State` tool now works without declaring a State subclass, and the channel wiring is uniform for tools and users. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): move end-of-trajectory fully into user space (no framework done) The base `vf.State` is now empty — the framework holds no opinion about its contents. A taskset that ends a trajectory from state declares its own flag and a `@vf.stop` over it; the interception server no longer references `state.done` anywhere (the opening-turn and post-turn loops just rely on `refused()` running the taskset's stops). The stop reason is the `@stop` method's name, so it's informative and taskset-controlled. - state.py: drop the reserved `done` field; State is a blank typed canvas. - taskset.py: drop the built-in `Taskset.done` stop. - interception/server.py: drop the opening + post-turn `state.done` checks. - User sims declare their own state + stop: echo/alphabet_sort/color_codeword use `user_finished`, textarena uses `game_over`. - Docs (GUIDE + docstrings) show the field+@Stop pattern. Verified: alphabet_sort (opening-turn + instruction=None) ends with stop_condition 'user_finished', reward 1.0; test_user/test_tool_state/test_tool green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(v1): warn on last-write-wins state; 400 on mismatched state PUT - GUIDE + _with_state docstring: the per-call state sync is a whole-object read-modify-write, so concurrent tool calls (several tool_calls in one turn) are last-write-wins and can lose each other's writes — keep shared-state mutations on the sequential path; taskset + servers must share one State. - handle_state_put: catch pydantic ValidationError and return 400 with the reason (a server pushing a shape that doesn't fit the trace's State type — usually a StateT mismatch) instead of an unhandled 500. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): also 400 on malformed-JSON state PUT (not just mismatched shape) Broaden handle_state_put's except to (ValidationError, ValueError) so a JSONDecodeError from request.json() surfaces as a clean 400 too, not a 500. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(v1): ruff format interception/server.py --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1715) ServerBase._serve only disabled FastMCP's DNS-rebinding protection when the server bound 0.0.0.0 (a self-publishing modal/prime runtime). A host-bound server (127.0.0.1, a subprocess/docker tool) reached by a REMOTE harness over a host_endpoint tunnel then got 421 Misdirected Request — the guard 421s the tunnel's Host. This failed the test_tool[in-prime-with-tool-{in-subprocess,in-docker,shared}] CI cells. Relax the guard unconditionally: these servers are reached only by our own harness over localhost or our tunnels, never a browser. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): add an `init` scaffolding command for new environments `uv run init <name>` scaffolds a v1 environment package following the `environments/*_v1` layout: a `pyproject.toml`, a package whose `__init__.py` re-exports the plugin via `__all__`, and a runnable `taskset.py` (replace `load_tasks` + the `@reward`). Parsed with pydantic-config like the other v1 commands (`InitConfig`); the v1 sibling of v0's `vf-init`. - `--add-tool` / `--add-user` / `--add-harness` scaffold a `vf.Toolset` / `vf.User` / `vf.Harness` and wire them in. The harness is exported alongside the taskset, selectable via `--harness.id <name>` (the loader filters `__all__` by base type, so one package can export both). - `--v0` scaffolds a legacy v0 `load_environment` package (delegates to `verifiers.scripts.init`) for backwards compatibility; rejected with `--add-*`. Registered as the `init` console script; documented in the README quickstart (alongside `validate`) and the GUIDE authoring section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): scaffold requires-python >=3.11 to match verifiers core verifiers requires >=3.11,<3.14; the scaffolded env depends on it, so >=3.10 was inconsistent. Declare >=3.11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): make the init scaffold a minimal skeleton Drop the baked-in demo (the WORDS list + the <answer>-regex exact-match reward). load_tasks and the @reward are now stubs that raise NotImplementedError — no task-specific data or scoring opinion in the scaffold — matching v0 vf-init's spirit. Tool/user/harness wiring is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| def _names(name: str) -> tuple[str, str, str, str]: | ||
| """`(dash, pkg, stem, prefix)` derived from a raw name: the hyphenated id, the importable | ||
| package (underscores), the `_v1`-less stem (for tool prefixes), and the CamelCase class | ||
| prefix (e.g. `my-task-v1` -> `my-task-v1`, `my_task_v1`, `my_task`, `MyTask`).""" | ||
| dash = name.strip().strip("/").replace("_", "-").lower() | ||
| pkg = dash.replace("-", "_") | ||
| stem = pkg[:-3] if pkg.endswith("_v1") else pkg | ||
| prefix = "".join(part[:1].upper() + part[1:] for part in stem.split("_") if part) | ||
| if not prefix or not prefix[0].isalpha(): | ||
| prefix = f"Env{prefix}" | ||
| return dash, pkg, stem, prefix |
There was a problem hiding this comment.
🟢 Low cli/init.py:46
When name is whitespace-only (e.g., " ") or only slashes (e.g., "/"), the strip() calls on line 50 return an empty string, so pkg becomes empty. This causes env_dir = Path(config.path) / pkg to resolve to the parent directory itself (e.g., ./environments), and scaffolded files like __init__.py and taskset.py are written there instead of a package subdirectory. Consider validating that pkg is non-empty after processing and raising an error for invalid names.
def _names(name: str) -> tuple[str, str, str, str]:
"""`(dash, pkg, stem, prefix)` derived from a raw name: the hyphenated id, the importable
package (underscores), the `_v1`-less stem (for tool prefixes), and the CamelCase class
prefix (e.g. `my-task-v1` -> `my-task-v1`, `my_task_v1`, `my_task`, `MyTask`)."""
dash = name.strip().strip("/").replace("_", "-").lower()
+ if not dash:
+ raise ValueError(f"invalid environment name: {name!r}")
pkg = dash.replace("-", "_")
stem = pkg[:-3] if pkg.endswith("_v1") else pkg
prefix = "".join(part[:1].upper() + part[1:] for part in stem.split("_") if part)
if not prefix or not prefix[0].isalpha():
prefix = f"Env{prefix}"
return dash, pkg, stem, prefix🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/cli/init.py around lines 46-56:
When `name` is whitespace-only (e.g., `" "`) or only slashes (e.g., `"/"`), the `strip()` calls on line 50 return an empty string, so `pkg` becomes empty. This causes `env_dir = Path(config.path) / pkg` to resolve to the parent directory itself (e.g., `./environments`), and scaffolded files like `__init__.py` and `taskset.py` are written there instead of a package subdirectory. Consider validating that `pkg` is non-empty after processing and raising an error for invalid names.
Evidence trail:
verifiers/v1/cli/init.py lines 46-56 (_names function), line 289 (env_dir = Path(config.path) / pkg), line 290 (pkg_dir = env_dir / pkg), line 339 (if not config.name: check on raw name, not processed pkg). Python Path behavior: Path('a') / '' resolves to Path('a').
* feat(v1): default the harness runtime to subprocess
The harness `runtime` defaulted to `DockerConfig()`, so every eval/train run
without an explicit `--harness.runtime.type` tried to start a container even
though most tasksets just run a local process. Flip the default to
`SubprocessConfig()` — the common, dependency-free case — and let tasksets that
genuinely need a container opt in via `--harness.runtime.type docker` (or
prime/modal). Tasksets carrying a per-task image or `NEEDS_CONTAINER` already
raise a clear error against the subprocess runtime, so the container-requiring
paths stay guarded.
Tool servers and the user simulator already default to subprocess; this aligns
the harness with them.
- Drop the now-redundant `runtime = { type = "subprocess" }` from the example
configs (alphabet_sort, textarena, wordle, gsm8k_rlm); docker stays explicit
where it's required (terminal-bench-2, harbor).
- `validate` keeps its docker default (a model-free gold check often needs the
task's declared container); reword its docs now that they can't say "like eval".
- Update README/GUIDE quickstart + runtime tables to mark subprocess as default.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* style: wrap runtimes import in harness.py (ruff format)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing (#1717) * fix v1 train client prefix bridging * feat(v1): token-based prefix reuse in the message graph Refine prepare_turn's message-hash prefix by token identity at commit: reuse a stored prefix node only when its tokens match what this turn rendered (longest common token prefix of the concatenated prefix vs prompt_ids), forking at the first divergence so retokenization drift surfaces as a branch instead of silent mis-attribution. Bridge path keeps the prior verbatim (matches fully, stays linear); eval path has no token ids (falls back to message hash). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): message-level vs renderer-level branching + leaf->root token invariant Two branching test cases asserting the graph invariant (leaf->root concat == the engine's prompt_ids + completion_ids): message-level fork via compaction (hash divergence, tokenless), and a renderer-level break (prior <think> dropped on re-render) that forks only under the train client (token ids) and is invisible to the eval relay. Document both + the invariant in ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): drop the branching unit tests (keep the token-reuse impl + ARCHITECTURE design notes) Remove the message-level / renderer-level branching unit tests and their helpers from test_graph.py, and the test/validation paragraph from the ARCHITECTURE branching section; branching is exercised end-to-end instead. The graph token-based reuse + the conceptual ARCHITECTURE notes (branch types + the leaf->root token invariant) stay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(v1): explain why ToolMessage.name exists (GPT-OSS Harmony + bridge) Most renderers key a tool result off tool_call_id, but GPT-OSS Harmony renders the function name (functions.<name>, else functions.unknown → broken token parity). The bridge sharpens it: it renders only the tail, so the issuing assistant's tool call is in the reused prefix and can't be recovered from the tail — the dialect recovers the name once from the full prompt and carries it on ToolMessage so later bridge tails have it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): this PR adds no tests (drop the bridge tests from the diff) Restore test_graph.py to the merge-base and delete test_train_client.py so the PR carries no test additions; branching/bridge are exercised end-to-end instead. The bridge + token-reuse implementation is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(v1): avoid O(context) per-token scans in turn commit The per-turn token-attribution path ran two O(context) per-token Python builds every turn — synchronous on the env-server worker event loop, so at multiplex=128 they serialize across rollouts (head-of-line blocking). At 500+ turns near the context cap this is seconds of blocking per rollout. - _commit_turn: replace the full `stored` concatenation + per-token `while` LCP with a node-wise C-level slice compare (short-circuits at the first divergent node) — ~8.6x faster (7.0 -> 0.8ms @128k), no full copy. - previous_token_ids: nested per-token comprehension -> per-node extend (~3.4x faster). - train client: build the (O(context)) previous-turn ids only after the cheap bridge guards pass, so non-bridgeable turns don't pay for it. Behavior-identical: ruff clean, test_graph.py passes, and a full fix-git train re-run yields the same graph (1 branch, invariant lossless). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): drop add_turn wrapper for prepare_turn/commit Remove the `add_turn` convenience wrapper; every caller now uses the explicit two-step `prepare_turn(trace, prompt).commit(response)` — one obvious way to build the graph. The v0 legacy bridge and the graph tests are migrated; docstring references updated. Also add a graph test for the renderer-level break: two turns with the same message sequence (identical hashes) but a retokenized prior assistant turn fork by token identity (2 branches), while matching tokens stay linear (1 branch) — and each branch's leaf->root concat equals its own prompt_ids + completion_ids. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: eligotts <78387377+eligotts@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(v1): fix test_legacy run_v1 kwarg (runtime -> agent_runtime) _eval_config (and every test_e2e caller) takes agent_runtime; test_legacy passed runtime=, raising TypeError when the e2e tests run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: run v1 tests in a separate -n auto -vv step Split tests/v1 out of the main test step into its own step run with pytest-xdist (-n auto) and -vv, and exclude it from the main step (--ignore=tests/v1). Coverage is appended across the two steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(v1): package counter_tool_v1 fixture for sandbox installs test_tool_state launches the counter toolset inside a docker/prime runtime via `python -m counter_tool_v1`, which uploads + installs the fixtures package. The module was missing from the fixtures pyproject `include`, so the sandbox wheel omitted it and the tool server died with "No module named counter_tool_v1" (surfacing as "server did not report its port"). Subprocess was unaffected (host PYTHONPATH). Add it to `include`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): surface the server log when a sandbox tool server never reports its port The port-file timeout raised a bare "did not report its port", hiding why the server died (e.g. an import error in the sandbox venv). Mirror the probe-failure path and append the server log tail, turning an opaque 180s timeout into an actionable error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(v1): make log_tail a public module-level helper Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…v1 envs (#1721) * test(envs): run env smoke tests through the unified eval CLI (v0/v1 dispatch) The env tests assumed the v0 contract (vf.load_environment + vf-eval + hub tags/README), so every v1 plugin failed. Dispatch on the env style instead: - Eval via the `eval` CLI: a `_v1` taskset through `--taskset.id`/`--harness.id default`, a v0 env through the legacy `--id` bridge. Capped (-n 1 -r 2 --max-turns 4 --sampling.max-tokens 512 --rich false) so CI stays quick; `-r 2` because a taskset with @group_reward(s) needs >=2 rollouts. - Load check dispatches too: v0 -> load_environment, v1 taskset -> taskset_class, the compact harness -> harness_class. - Metadata: `tags` + README are a v0 hub convention, so they're only required of v0 envs; v1 plugins (the `_v1` examples + `compact`) are exempt. - Skip what can't run in plain CI: the SWE/container v1 tasksets (r2e_gym_v1, scaleswe_v1, swelego_v1, terminal_bench_2_v1); `compact` (a harness, not an evaluatable taskset); self_reward (group-only rubric the v0 bridge can't score per-rollout). - Run the test-envs job with `-n auto`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: ruff format test_envs.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(envs): keep test_envs.py v0-only, add tests/v1/test_envs.py for v1 The v0 env smoke tests (vf.load_environment + vf-eval, tags/README metadata) don't fit v1 plugins, so filter `get_environments()` to v0 envs only — the `_v1` tasksets and the `compact` harness are excluded. Add tests/v1/test_envs.py: smoke-eval every `_v1` taskset in environments/ through the `eval` CLI (--taskset.id <id> --harness.id default) for one short capped rollout (-n 1 -r 2 --max-turns 4 --sampling.max-tokens 512), and require success. `compact` is excluded (a harness, not a taskset); the SWE/container tasksets (r2e_gym_v1, scaleswe_v1, swelego_v1, terminal_bench_2_v1) skip — they need a docker/prime runtime and are covered by the dedicated v1 e2e tests. This supersedes the earlier in-place v0/v1 dispatch in test_envs.py (and the test-envs -n auto change), keeping the v0 test as it was. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Uh oh!
There was an error while loading. Please reload this page.