Skip to content

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6

Open
varunursekar wants to merge 1 commit into
harbor-3-compilerfrom
harbor-4-docs
Open

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6
varunursekar wants to merge 1 commit into
harbor-3-compilerfrom
harbor-4-docs

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 4 of 4 — targets harbor-3-compiler. Additive, low-risk.

  • docs/harbor/architecture.md — what it is, the compiled-task topology, the two modes, the component map, and the leaderboard-integrity model.
  • docs/harbor/tutorial.md — build + run end to end (both modes, the agent-side protocol); README Harbor section.
  • examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (thin Terminus2 subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Start your reading here for the big picture, then dive into [1/4]–[3/4].

Stack: [1/4] core → [2/4] sidecar → [3/4] compiler → this.

🤖 Generated with Claude Code

Greptile Summary

This PR is the final stack entry (4/4) for the Harbor integration, adding architecture docs, a tutorial, and a runnable Mode-B example (gaia-optimization) that optimizes a GaiaAgent prompt against real GAIA tasks via a nested harbor run on Modal.

  • Documentation (docs/harbor/architecture.md, docs/harbor/tutorial.md): covers the compiled-task topology, the two evaluation modes (A = vero scores, B = nested Harbor run scores), the trust boundary / leaderboard-integrity model, and the full CLI walkthrough end to end.
  • GAIA example (examples/gaia-optimization): a thin Terminus2 subclass that redirects the prompt-template path to an editable prompts/ directory, a build.yaml wiring up Mode B on Modal, and the copied prompt templates that form the optimization surface.
  • README update: adds a Harbor integration section with a quick-start snippet and links to the new docs.

Confidence Score: 4/5

Entirely additive — new docs and an example package with no changes to vero core; safe to merge.

The changes are documentation and an example that adds no new runtime paths to vero itself. The two issues found are minor: typos in the XML prompt template (which is the optimization surface — an optimizer would fix them during a run anyway) and a potential @staticmethod vs instance-method mismatch on version() in GaiaAgent that could surface only if Harbor calls GaiaAgent.version() as a class-level static.

src/gaia_agent/agent.py and src/gaia_agent/prompts/terminus-xml-plain.txt have the two flagged issues; all other files are clean.

Important Files Changed

Filename Overview
vero/README.md Adds a Harbor integration section with install snippet and links to docs/examples; purely additive and accurate.
vero/docs/harbor/architecture.md New architecture doc covering the compiled-task topology, two modes (A/B), trust boundary, and component map; thorough and internally consistent.
vero/docs/harbor/tutorial.md New tutorial doc showing both modes end to end, the agent-side protocol, and how to inspect runs; matches the architecture doc.
vero/examples/gaia-optimization/README.md Clear example README covering prerequisites, run instructions, caveats, and attribution for the copied prompt files.
vero/examples/gaia-optimization/build.yaml Mode-B build config with sensible budget (3 train evals), correct split visibility tiers, and well-commented placeholder task IDs.
vero/examples/gaia-optimization/pyproject.toml Minimal package config; force-include correctly bundles the editable prompts directory into the wheel so agent.py's file-relative path resolves after install.
vero/examples/gaia-optimization/src/gaia_agent/agent.py Thin Terminus2 subclass redirecting the prompt-template path; accesses private _parser_name and has a version()/name() static/instance inconsistency worth aligning with the base class.
vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-json-plain.txt JSON prompt template copied from Harbor's terminus_2; correctly uses double-braced literals and single-braced format variables; no issues.
vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt XML prompt template with two typos ("apprpriate" and "In is always possible") that will appear verbatim in every agent call.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Dev as Developer
    participant VeroHarbor as vero harbor CLI
    participant Main as main container (optimizer)
    participant Sidecar as eval-sidecar (vero harbor serve)
    participant Modal as Modal (nested harbor run)
    participant Verifier as tests/test.sh (shared verifier)

    Dev->>VeroHarbor: vero harbor build -c build.yaml -o /tmp/task
    VeroHarbor-->>Dev: Harbor task dir (compose + Dockerfiles + instruction.md)

    Dev->>Main: harbor run -p /tmp/task -a claude-code -e docker
    activate Main
    activate Sidecar
    Note over Sidecar: vero harbor serve starts, writes per-trial admin token (root:600)

    Main->>Main: optimizer edits prompts/, commits
    Main->>Sidecar: "POST /eval?split=train"
    Sidecar->>Sidecar: git fetch commit (file://, hooks disabled)
    Sidecar->>Modal: harbor run GaiaAgent on train tasks
    Modal-->>Sidecar: per-task verifier rewards
    Sidecar-->>Main: aggregate score + remaining budget

    Note over Main: repeat edits + evals within budget

    Main->>Verifier: trial end — tests/test.sh runs
    Verifier->>Sidecar: POST /finalize (admin token)
    Sidecar->>Sidecar: select best train commit
    Sidecar->>Modal: harbor run on hidden validation tasks
    Modal-->>Sidecar: accuracy
    Sidecar-->>Verifier: reward.json
    deactivate Sidecar
    deactivate Main
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Dev as Developer
    participant VeroHarbor as vero harbor CLI
    participant Main as main container (optimizer)
    participant Sidecar as eval-sidecar (vero harbor serve)
    participant Modal as Modal (nested harbor run)
    participant Verifier as tests/test.sh (shared verifier)

    Dev->>VeroHarbor: vero harbor build -c build.yaml -o /tmp/task
    VeroHarbor-->>Dev: Harbor task dir (compose + Dockerfiles + instruction.md)

    Dev->>Main: harbor run -p /tmp/task -a claude-code -e docker
    activate Main
    activate Sidecar
    Note over Sidecar: vero harbor serve starts, writes per-trial admin token (root:600)

    Main->>Main: optimizer edits prompts/, commits
    Main->>Sidecar: "POST /eval?split=train"
    Sidecar->>Sidecar: git fetch commit (file://, hooks disabled)
    Sidecar->>Modal: harbor run GaiaAgent on train tasks
    Modal-->>Sidecar: per-task verifier rewards
    Sidecar-->>Main: aggregate score + remaining budget

    Note over Main: repeat edits + evals within budget

    Main->>Verifier: trial end — tests/test.sh runs
    Verifier->>Sidecar: POST /finalize (admin token)
    Sidecar->>Sidecar: select best train commit
    Sidecar->>Modal: harbor run on hidden validation tasks
    Modal-->>Sidecar: accuracy
    Sidecar-->>Verifier: reward.json
    deactivate Sidecar
    deactivate Main
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt:26-28
Two typos on this line: "apprpriate" should be "appropriate", and directly after on the next line "In is always possible" should be "It is always possible". Since this file is the LLM prompt template served verbatim to the inner GAIA agent, these errors appear in every agent turn and could subtly degrade instruction-following.

```suggestion
The `duration` attribute of <keystrokes> specifies the number of seconds to wait for the command to complete (default: 1.0) before the next command will be executed. On immediate tasks (e.g., cd, ls, echo, cat) set a duration of 0.1 seconds. On commands (e.g., gcc, find, rustc) set a duration of 1.0 seconds. On slow commands (e.g., make, python3 [long running script], wget [file]) set an appropriate duration as you determine necessary.

It is better to set a smaller duration than a longer duration. It is always possible to wait again if the prior output has not finished, by running <keystrokes duration="10.0"></keystrokes> on subsequent requests to wait longer. Never wait longer than 60 seconds; prefer to poll to see intermediate result status.
```

### Issue 2 of 2
vero/examples/gaia-optimization/src/gaia_agent/agent.py:30-31
`version()` signature mismatch with `name()`

`name()` is declared as a `@staticmethod` but `version()` is an instance method. If `Terminus2` defines `version()` as a `@staticmethod` (the typical pattern when `name()` is also static), then `GaiaAgent.version()` won't properly override it when called as `GaiaAgent.version()` on the class rather than on an instance — the base class static will shadow it. This is worth aligning with however `Terminus2` declares `version()`.

Reviews (1): Last reviewed commit: "Harbor: architecture docs, tutorial, and..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

- docs/harbor/architecture.md — what the integration is, the compiled-task topology,
  the two evaluation modes, the component map, and the leaderboard-integrity model.
- docs/harbor/tutorial.md — build and run an optimization task end to end (both modes,
  the agent-side protocol), and a Harbor section in the README.
- examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (a thin Terminus2
  subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:18
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:22
Comment on lines +30 to +31
def version(self) -> str:
return "0.1.0"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 version() signature mismatch with name()

name() is declared as a @staticmethod but version() is an instance method. If Terminus2 defines version() as a @staticmethod (the typical pattern when name() is also static), then GaiaAgent.version() won't properly override it when called as GaiaAgent.version() on the class rather than on an instance — the base class static will shadow it. This is worth aligning with however Terminus2 declares version().

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/examples/gaia-optimization/src/gaia_agent/agent.py
Line: 30-31

Comment:
`version()` signature mismatch with `name()`

`name()` is declared as a `@staticmethod` but `version()` is an instance method. If `Terminus2` defines `version()` as a `@staticmethod` (the typical pattern when `name()` is also static), then `GaiaAgent.version()` won't properly override it when called as `GaiaAgent.version()` on the class rather than on an instance — the base class static will shadow it. This is worth aligning with however `Terminus2` declares `version()`.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant