Skip to content

feat(agent): deliver client tools to Claude over the internal MCP channel#4985

Open
mmabrouk wants to merge 8 commits into
big-agentsfrom
feat/claude-client-tools
Open

feat(agent): deliver client tools to Claude over the internal MCP channel#4985
mmabrouk wants to merge 8 commits into
big-agentsfrom
feat/claude-client-tools

Conversation

@mmabrouk

@mmabrouk mmabrouk commented Jun 30, 2026

Copy link
Copy Markdown
Member

What this does

Client tools are browser-fulfilled tools like request_connection: the model calls the tool, the run pauses, a human answers in the Agenta UI, and the next turn resumes with the stored answer. Before this PR they worked on the Pi harness only. On Claude the tool was filtered out of the MCP advertisement, so the model never saw it and the run silently delivered nothing.

After this PR, Claude sees client tools over the internal agenta-tools MCP server and pauses correctly when it calls one. Both harnesses now pause through one shared seam instead of two divergent code paths.

How the pause works on Claude

Claude calls the tool over loopback HTTP (JSON-RPC tools/call). Answering with an error would settle the call, so the runner instead emits no result and destroys the socket (MCP_PAUSED). Claude cannot settle the call, the turn ends paused, and the browser answer resumes it next turn. The MCP SDK does not retry POSTs, and a duplicate-POST regression test pins that assumption.

This is a recut, not a rebase

The original branch predates the approval-boundary merge, which rewrote every seam this PR touches (responder, permissions, pausing). The branch was rebuilt commit-by-commit against the new architecture:

Reviewed before reaching you

Codex (xhigh) plus an internal deep-review pass ran against the branch; both blockers they found are fixed here:

  1. Claude settings never rendered rules for client tools (Codex). Without a rule, Claude's own permission prompt fired before tools/call, bypassing the new pause path. Fixed in claude_settings.py: client tools render allow unless explicitly denied — the runner-side gate is the authoritative one, and its pause is the ask flow for a client tool.
  2. The ACP correlation index never matched live (internal review). It recorded Claude's mcp__agenta-tools__<name> titles but was queried with bare names, so pause suppression (F-024) and widget attachment silently fell back to a UUID no ACP frame carries. Fixed: titles are normalized on record, and lookups consume the matched id so a duplicate identical call correlates to its own frame. Both fixes carry regression tests that would have caught the bugs.

Also in the fix pass: MCP_PARKED renamed to MCP_PAUSED, the ClientToolRelay type moved to a neutral module, and over-claiming comments corrected.

What to review (inline comments mark each spot)

Read in this order: the seam (client-tools.ts), the FIFO store (responder.ts), the pause mechanism (tool-mcp-http.ts), the gate change (run-plan.ts), the settings change (claude_settings.py). Five inline comments walk you through the reasoning at each spot.

Gate change you already ruled on

#5047's remote-tools gate exempted client tools because they never rode the MCP channel. This PR makes them ride it, so the exemption would reopen the silent drop on remote sandboxes. The gate now counts all custom tools for non-Pi remote runs; Pi stays exempt (its extension delivers via the file relay). The old "allows claude x daytona x client-only" test now asserts refusal.

Tests

Runner: 516 passed (42 files), typecheck clean. Python SDK: 515 passed in the agents suite, wire contract unchanged (goldens untouched).

Known follow-ups (deliberate, not forgotten)

  • Four cosmetic changes (two comment fixes, one extra positive gate test, a type-dedup in relay.ts) are held back because a concurrent lane (pi-builtin-gating) has uncommitted work in those files; they land as a tiny follow-up once that lane commits.
  • Batched tools/call with a paused sibling call is latent (Claude does not batch today) — documented, not fixed.
  • An in-flight gateway call keeps executing server-side after a pause (socket is destroyed, execution is not) — pre-existing window, comment now says so honestly; threading an AbortSignal into dispatch is a tracked follow-up.
  • Live QA (one local-Claude request_connection round trip) still pending; the internal reviewer recommends it before merge and I agree.

https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT

@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 30, 2026
@vercel

vercel Bot commented Jun 30, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jul 4, 2026 6:20pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a shared client-tool "park/resume" relay used by both Pi and Claude tool-delivery paths, extends the internal loopback MCP channel to advertise and park browser-fulfilled client tools without producing JSON-RPC results, separates approval decisions from a new FIFO client-tool output store in the responder, rejects Daytona runs for non-Pi harnesses carrying custom tools, consolidates schema/validation helpers into a shared module, adds idle-backoff relay polling, and updates design documentation and tests accordingly.

Changes

Client-tool relay, MCP delivery, and Daytona gating

Layer / File(s) Summary
Shared tool-spec schema helpers
services/agent/src/tools/spec-schema.ts, services/agent/src/tools/dispatch.ts, services/agent/src/extensions/agenta.ts, services/agent/src/tools/public-spec.ts, services/agent/tests/unit/spec-schema.test.ts
Centralizes input-schema access, required-field detection, and argument assertion in a new module; dependent files import these helpers instead of duplicating them; publicToolSpec(s) renamed to advertisedToolSpec(s).
Responder client-tool output store
services/agent/src/protocol.ts, services/agent/src/responder.ts, services/agent/tests/unit/responder.test.ts
Adds RenderHint "connect" variant and ClientToolRelay/ClientToolOutputs types; splits approval decisions from a FIFO client-tool output store keyed by cold-replay anchors; updates HITLResponder constructor and resolution logic.
Shared client-tool relay seam
services/agent/src/engines/sandbox_agent/client-tools.ts, services/agent/tests/unit/client-tools.test.ts
New module providing tool-call correlation indexing, emitClientToolInteraction, and buildClientToolRelay used by both Pi and Claude delivery paths.
Internal MCP channel parking
services/agent/src/tools/mcp-bridge.ts, services/agent/src/tools/tool-mcp-http.ts, services/agent/tests/unit/tool-bridge.test.ts
Adds options-based relay/signal/log wiring; tools/call parks client-tool requests, aborts in-flight HTTP responses, and tears down on engine abort; tools/list advertises all specs including client.
Sandbox-agent driver wiring
services/agent/src/engines/sandbox_agent.ts, services/agent/src/engines/sandbox_agent/mcp.ts, services/agent/src/engines/sandbox_agent/permissions.ts, services/agent/src/engines/sandbox_agent/pi-assets.ts, services/agent/tests/unit/sandbox-agent-orchestration.test.ts, services/agent/tests/unit/sandbox-agent-pi-assets.test.ts
Wires toolCallIndex, deferred clientToolRelay, and mcpAbort into the driver; threads relay/signal through session MCP servers; simplifies permissions to user-approval only; switches Pi env to advertisedToolSpecs.
Daytona tool-delivery gating
services/agent/src/engines/sandbox_agent/run-plan.ts, services/agent/tests/unit/sandbox-agent-run-plan.test.ts
Adds DAYTONA_TOOL_DELIVERY_UNSUPPORTED_MESSAGE and an up-front rejection for non-Pi Daytona runs with custom tools.
Relay idle-backoff polling
services/agent/src/tools/relay.ts, services/agent/tests/unit/tool-relay-permission.test.ts
Adds relayPollDelayMs exponential backoff capped at RELAY_POLL_MAX_MS, relocates relay types as re-exports from responder.ts.
Design documentation
docs/design/agent-workflows/documentation/..., docs/design/agent-workflows/interfaces/...
Updates docs describing the client-tool park/resume seam, internal MCP channel contract, and Daytona rejection behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Claude
  participant InternalMcpServer
  participant ClientToolRelay
  participant Responder
  participant Browser

  Claude->>InternalMcpServer: tools/call (client tool)
  InternalMcpServer->>ClientToolRelay: onClientTool(request)
  ClientToolRelay->>Responder: onClientTool(request)
  Responder-->>ClientToolRelay: "park" (no stored output)
  ClientToolRelay->>ClientToolRelay: emitClientToolInteraction
  InternalMcpServer->>InternalMcpServer: abort HTTP response (MCP_PARKED)
  Browser->>Responder: deliver browser tool output (next turn)
  Claude->>InternalMcpServer: re-call tools/call
  InternalMcpServer->>ClientToolRelay: onClientTool(request)
  ClientToolRelay->>Responder: takeClientOutput(coldReplayKey)
  Responder-->>ClientToolRelay: resolved output
  ClientToolRelay-->>InternalMcpServer: output
  InternalMcpServer-->>Claude: JSON-RPC result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

  • Agenta-AI/agenta#4838: Both hard-gate non-Pi tool delivery support at run-build time, overlapping in the run-time tooling-delivery gating path.
  • Agenta-AI/agenta#4847: Extends the same internal agenta-tools loopback MCP channel to add cross-turn client-tool parking/resume via /tools/call.
  • Agenta-AI/agenta#4934: Implements the frontend counterpart of the same client-tool park/resume round-trip contract introduced here.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 77.14% which is sufficient. The required threshold is 60.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: delivering client tools to Claude over the internal MCP channel.
Description check ✅ Passed The description is directly related and accurately explains the client-tool delivery and pause/resume changes.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/claude-client-tools

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@dosubot dosubot Bot added enhancement New feature or request refactoring A code change that neither fixes a bug nor adds a feature labels Jun 30, 2026
@mmabrouk mmabrouk added the needs-review Agent updated; awaiting Mahmoud's review label Jun 30, 2026
@mmabrouk

Copy link
Copy Markdown
Member Author

🤖 The AI agent says:

What to review — two distinct things:

  1. The mechanism (load-bearing): the Claude client-tool park in the MCP tools/call handler — on park it emits NO JSON-RPC result and the request listener aborts the in-flight request (res.destroy(), no body), backed by an engine AbortSignal. The premise is "no-result-before-finish": any normal MCP result would let Claude settle the call and clobber the pending connect widget before the turn ends paused. Please sanity-check that reasoning and the abort path (tools/tool-mcp-http.ts + the deferred relay wiring in engines/sandbox_agent.ts). Unit-tested, but the live Claude round-trip is NOT yet run.

  2. The blocker (please advise): this lane targets services/agent/, which big-agents has since renamed to services/runner/ ([chore] Clean up big-agents platform Part I #4968), plus big-agents stripped the [hitl-resume]/RUN FAILED debug logs from 3 of the files I rewrote. So this PR will not merge cleanlybut pull says the lane is rebasable: false. I did NOT force a rebase (shared workspace + directory rename = scramble risk). Question: do you want me to do the services/agentservices/runner path-migration (mechanical: move + drop the same 3 debug-log blocks), or will you rebase it?

Also flagged in the description: live QA is pending the sidecar deploy, and the inventory index-row table cell couldn't be committed (GitButler hunk-lock against the sibling unassigned README.md files) — the detailed inventory page is updated.

@mmabrouk

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mmabrouk

Copy link
Copy Markdown
Member Author

🤖 The AI agent says:

Status: implemented, 400 tests green, but not merge-ready yet — please do not merge. Two things gate it.

🔸 Decision needed — base drift / directory rename. big-agents is ~57 commits ahead of this lane's base and renamed services/agent/ to services/runner/ (#4968). This lane is built on services/agent/ paths, so it does not rebase cleanly (file-location conflicts plus content conflicts in 3 files where debug logs were also stripped). Options:

  • (a) migrate this lane's changes onto services/runner/ and resolve the 3 conflicts, or
  • (b) rebuild the changes fresh on current big-agents.

The work and tests are done, so (a) is mostly mechanical. Recommendation: (a). Which do you want?

Codex (xhigh) review — 3 fixes before merge:

  1. (P1) Abort is not wired end-to-end. The MCP tools/call handler calls runResolvedTool without the server abort signal, and teardown closes the MCP server without aborting mcpAbort, so an in-flight non-client call can keep running after a park or teardown and then settle against a destroyed response. (tool-mcp-http.ts:205/378, sandbox_agent.ts:716)
  2. (P2) Batch handling runs side-effecting calls even when one parks. Promise.all over a JSON-RPC batch can execute a sibling gateway/callback tool before the batch is aborted. Short-circuit before launching the others, or reject mixed batches. (tool-mcp-http.ts:313)
  3. (P2) Resume-store collision. A client tool whose real browser output is { approved: true } gets misrouted into the permission store rather than the client-output store. (responder.ts:425/488)

Verified OK by Codex: the non-Pi Daytona guard scope, the empty-schema accessor fix, and the Pi path still routing through the shared seam.

Plan: fix the 3 findings during the services/runner migration, then run live QA (the Pi and Claude request_connection round-trips), then mark this ready for review.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
services/agent/src/responder.ts (1)

388-436: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Duplicate full-history scan: extractApprovalDecisions and extractClientToolOutputs each rebuild the same indexes.

Both functions independently call buildCallShapeIndex(request) and re-iterate toolResultBlocks(request) over the entire message history. Per sandbox_agent.ts, they're always invoked back-to-back (extractApprovalDecisions(request) then extractClientToolOutputs(request)), so every turn pays for two full passes over request.messages (which grows with conversation length) instead of one.

Consider merging them into a single pass that builds callShapeById once and populates both the decisions map and the outputs map from one iteration over toolResultBlocks(request).

♻️ Sketch of a combined extraction
-export function extractApprovalDecisions(
-  request: AgentRunRequest,
-): Map<string, unknown> {
-  const decisions = new Map<string, unknown>();
-  const callShapeById = buildCallShapeIndex(request);
-  ...
-  for (const block of toolResultBlocks(request)) {
-    const decision = approvalDecisionOf(block);
-    if (decision === undefined) continue;
-    const argsKey = coldReplayKey(block, callShapeById);
-    if (argsKey) decisions.set(argsKey, decision);
-  }
-  return decisions;
-}
-
-export function extractClientToolOutputs(
-  request: AgentRunRequest,
-): Map<string, unknown[]> {
-  const outputs = new Map<string, unknown[]>();
-  const callShapeById = buildCallShapeIndex(request);
-  for (const block of toolResultBlocks(request)) {
-    if (approvalDecisionOf(block) !== undefined) continue;
-    const argsKey = coldReplayKey(block, callShapeById);
-    if (!argsKey) continue;
-    const list = outputs.get(argsKey) ?? [];
-    list.push(block.output);
-    outputs.set(argsKey, list);
-  }
-  return outputs;
-}
+export function extractHitlStores(request: AgentRunRequest): {
+  decisions: Map<string, unknown>;
+  clientOutputs: Map<string, unknown[]>;
+} {
+  const decisions = new Map<string, unknown>();
+  const clientOutputs = new Map<string, unknown[]>();
+  const callShapeById = buildCallShapeIndex(request);
+  for (const block of toolResultBlocks(request)) {
+    const argsKey = coldReplayKey(block, callShapeById);
+    if (!argsKey) continue;
+    const decision = approvalDecisionOf(block);
+    if (decision !== undefined) {
+      decisions.set(argsKey, decision);
+      continue;
+    }
+    const list = clientOutputs.get(argsKey) ?? [];
+    list.push(block.output);
+    clientOutputs.set(argsKey, list);
+  }
+  return { decisions, clientOutputs };
+}
services/agent/tests/unit/sandbox-agent-orchestration.test.ts (1)

357-372: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate union member in the cast.

Lines 369 repeat the identical { onClientTool?: unknown; onPark?: unknown } member twice in the union type cast for relayArgs[6]. This is redundant and looks like a copy/paste artifact; it doesn't break the test but should be deduplicated.

🧹 Proposed fix to remove the duplicate union member
     const relay = relayArgs[6] as
       | { onClientTool?: unknown; onPark?: unknown }
-      | { onClientTool?: unknown; onPark?: unknown }
       | undefined;

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7b014c75-499d-49a9-b4ef-6162ec928023

📥 Commits

Reviewing files that changed from the base of the PR and between c382a73 and 097d14e.

📒 Files selected for processing (28)
  • docs/design/agent-workflows/documentation/ground-truth.md
  • docs/design/agent-workflows/documentation/tools.md
  • docs/design/agent-workflows/interfaces/cross-service/runner-to-mcp-server.md
  • docs/design/agent-workflows/interfaces/in-service/tool-models-and-resolution.md
  • docs/design/agent-workflows/interfaces/public-edge/agent-config-schema.md
  • services/agent/src/engines/sandbox_agent.ts
  • services/agent/src/engines/sandbox_agent/client-tools.ts
  • services/agent/src/engines/sandbox_agent/mcp.ts
  • services/agent/src/engines/sandbox_agent/permissions.ts
  • services/agent/src/engines/sandbox_agent/pi-assets.ts
  • services/agent/src/engines/sandbox_agent/run-plan.ts
  • services/agent/src/extensions/agenta.ts
  • services/agent/src/protocol.ts
  • services/agent/src/responder.ts
  • services/agent/src/tools/dispatch.ts
  • services/agent/src/tools/mcp-bridge.ts
  • services/agent/src/tools/public-spec.ts
  • services/agent/src/tools/relay.ts
  • services/agent/src/tools/spec-schema.ts
  • services/agent/src/tools/tool-mcp-http.ts
  • services/agent/tests/unit/client-tools.test.ts
  • services/agent/tests/unit/responder.test.ts
  • services/agent/tests/unit/sandbox-agent-orchestration.test.ts
  • services/agent/tests/unit/sandbox-agent-pi-assets.test.ts
  • services/agent/tests/unit/sandbox-agent-run-plan.test.ts
  • services/agent/tests/unit/spec-schema.test.ts
  • services/agent/tests/unit/tool-bridge.test.ts
  • services/agent/tests/unit/tool-relay-permission.test.ts

Comment on lines +62 to +75
const argsKey = parkedCallKey(name, u.rawInput);
if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId);
if (name && !byName.has(name)) byName.set(name, toolCallId);
},
lookup(toolName, input) {
const argsKey = parkedCallKey(toolName, input);
if (argsKey) {
const hit = byArgsKey.get(argsKey);
if (hit) return hit;
}
if (toolName) {
const hit = byName.get(toolName);
if (hit) return hit;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Avoid correlating by bare tool name after an exact-args miss.

Line 72 falls back to the first call with the same name even when parkedCallKey(toolName, input) missed. That defeats the fail-closed correlation contract and can attach the client_tool interaction to the wrong ACP tool-call bubble for repeated/different request_connection calls. Keep the name fallback only for unambiguous records that lacked raw input, and return undefined on explicit-args mismatches.

Suggested direction
 export function createToolCallCorrelationIndex(): ToolCallCorrelationIndex {
   const byArgsKey = new Map<string, string>();
-  const byName = new Map<string, string>();
+  const byNameFallback = new Map<string, { id: string; ambiguous: boolean }>();
+  const namesWithExplicitArgs = new Set<string>();
   return {
@@
-      const argsKey = parkedCallKey(name, u.rawInput);
+      const hasExplicitArgs = u.rawInput !== undefined && u.rawInput !== null;
+      if (name && hasExplicitArgs) namesWithExplicitArgs.add(name);
+      const argsKey = parkedCallKey(name, u.rawInput);
       if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId);
-      if (name && !byName.has(name)) byName.set(name, toolCallId);
+      if (name) {
+        const existing = byNameFallback.get(name);
+        if (!existing) byNameFallback.set(name, { id: toolCallId, ambiguous: false });
+        else existing.ambiguous = true;
+      }
@@
       if (toolName) {
-        const hit = byName.get(toolName);
-        if (hit) return hit;
+        if (namesWithExplicitArgs.has(toolName)) return undefined;
+        const fallback = byNameFallback.get(toolName);
+        if (fallback && !fallback.ambiguous) return fallback.id;
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const argsKey = parkedCallKey(name, u.rawInput);
if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId);
if (name && !byName.has(name)) byName.set(name, toolCallId);
},
lookup(toolName, input) {
const argsKey = parkedCallKey(toolName, input);
if (argsKey) {
const hit = byArgsKey.get(argsKey);
if (hit) return hit;
}
if (toolName) {
const hit = byName.get(toolName);
if (hit) return hit;
}
const byArgsKey = new Map<string, string>();
const byNameFallback = new Map<string, { id: string; ambiguous: boolean }>();
const namesWithExplicitArgs = new Set<string>();
return {
...
const hasExplicitArgs = u.rawInput !== undefined && u.rawInput !== null;
if (name && hasExplicitArgs) namesWithExplicitArgs.add(name);
const argsKey = parkedCallKey(name, u.rawInput);
if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId);
if (name) {
const existing = byNameFallback.get(name);
if (!existing) byNameFallback.set(name, { id: toolCallId, ambiguous: false });
else existing.ambiguous = true;
}
},
lookup(toolName, input) {
const argsKey = parkedCallKey(toolName, input);
if (argsKey) {
const hit = byArgsKey.get(argsKey);
if (hit) return hit;
}
if (toolName) {
if (namesWithExplicitArgs.has(toolName)) return undefined;
const fallback = byNameFallback.get(toolName);
if (fallback && !fallback.ambiguous) return fallback.id;
}

Comment thread services/agent/src/responder.ts Outdated
Comment on lines +483 to +499
/**
* A parked call reply. Permission responses use `{ approved: boolean }`; client tools carry their
* real structured `output`.
* An approval reply uses an `{ approved: boolean }` envelope (the Vercel adapter's
* `_approval_response_blocks` shape), unique to a permission response. Returns `"allow"`/`"deny"`
* for one, or `undefined` for any other tool_result (a real browser/client output).
*/
function parkedCallResultOf(block: ContentBlock): { found: boolean; output?: unknown } {
if (!block || block.type !== "tool_result") return { found: false };
function approvalDecisionOf(block: ContentBlock): PermissionDecision | undefined {
if (!block || block.type !== "tool_result") return undefined;
const output = block.output;
if (
output &&
typeof output === "object" &&
!Array.isArray(output) &&
typeof (output as { approved?: unknown }).approved === "boolean"
) {
return {
found: true,
output: (output as { approved: boolean }).approved ? "allow" : "deny",
};
return (output as { approved: boolean }).approved ? "allow" : "deny";
}
return { found: true, output };
return undefined;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Approval vs. client-output discriminator relies on a fragile value-shape heuristic.

approvalDecisionOf classifies a tool_result as an approval reply purely by checking whether output is an object with a boolean approved field. This is the exact class of problem the FIFO split was built to fix for the literal "allow"/"deny" string case (per the docstring on ClientToolOutputs, Line 144-147), but it isn't fixed here: a genuine client-tool output that happens to include an approved: boolean field (e.g. a future request_connection/commit_revision-style tool, or any new client tool whose result schema includes that key) will be:

  • excluded from extractClientToolOutputs (Line 425: if (approvalDecisionOf(block) !== undefined) continue;),
  • stored only in decisions (consulted exclusively by onPermission),
  • never found by onClientTool/takeClientOutput (which only reads clientOutputs),

causing that call to permanently re-park (or deny, headless) even though the browser already produced a real result. Since the only signal available here is the raw output value (no explicit kind discriminator travels with the tool_result block), this collision can't be ruled out for tools whose output schema isn't controlled by this code.

Consider carrying an explicit discriminator on the wire (e.g., a ContentBlock flag distinguishing an approval-envelope reply from a generic tool output) rather than inferring it from the output's shape.

Comment on lines +314 to +321
const responses = await Promise.all(
parsed.map((m) =>
handle(m, specByName, specs, relayDir, clientToolRelay, log),
),
);
// A parked client tool in the batch aborts the whole request (no result for any).
if (responses.some((r) => r === MCP_PARKED)) {
abortParked(res);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not start side-effecting batch tools before checking for a parked client call.

Promise.all starts every batch item concurrently. If one client tool parks, Lines 319-321 abort the whole response, but any callback/code tool already started may have mutated state and then receive no JSON-RPC result, making a retry duplicate the side effect. Handle client-tool calls first and short-circuit on MCP_PARKED before dispatching executable tools, or reject mixed batches.

Suggested direction
-          const responses = await Promise.all(
-            parsed.map((m) =>
-              handle(m, specByName, specs, relayDir, clientToolRelay, log),
-            ),
-          );
-          // A parked client tool in the batch aborts the whole request (no result for any).
-          if (responses.some((r) => r === MCP_PARKED)) {
-            abortParked(res);
-            return;
-          }
+          const messages = parsed as any[];
+          const clientMessages: any[] = [];
+          const otherMessages: any[] = [];
+          for (const m of messages) {
+            const spec =
+              m?.method === "tools/call" ? specByName.get(m?.params?.name) : undefined;
+            ((spec?.kind ?? "callback") === "client"
+              ? clientMessages
+              : otherMessages
+            ).push(m);
+          }
+
+          const responses: unknown[] = [];
+          for (const m of [...clientMessages, ...otherMessages]) {
+            const response = await handle(
+              m,
+              specByName,
+              specs,
+              relayDir,
+              clientToolRelay,
+              log,
+            );
+            if (response === MCP_PARKED) {
+              abortParked(res);
+              return;
+            }
+            responses.push(response);
+          }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const responses = await Promise.all(
parsed.map((m) =>
handle(m, specByName, specs, relayDir, clientToolRelay, log),
),
);
// A parked client tool in the batch aborts the whole request (no result for any).
if (responses.some((r) => r === MCP_PARKED)) {
abortParked(res);
const messages = parsed as any[];
const clientMessages: any[] = [];
const otherMessages: any[] = [];
for (const m of messages) {
const spec =
m?.method === "tools/call" ? specByName.get(m?.params?.name) : undefined;
((spec?.kind ?? "callback") === "client"
? clientMessages
: otherMessages
).push(m);
}
const responses: unknown[] = [];
for (const m of [...clientMessages, ...otherMessages]) {
const response = await handle(
m,
specByName,
specs,
relayDir,
clientToolRelay,
log,
);
if (response === MCP_PARKED) {
abortParked(res);
return;
}
responses.push(response);
}

Comment on lines +24 to +43
RELAY_POLL_MAX_MS,
RELAY_POLL_MS,
relayPollDelayMs,
resolvePermission,
startToolRelay,
type RelayResponse,
} from "../../src/tools/relay.ts";
import type { ResolvedToolSpec } from "../../src/protocol.ts";

describe("relayPollDelayMs (idle backoff)", () => {
it("polls at the base rate while busy, then backs off geometrically up to the cap", () => {
// No idle polls -> base rate.
assert.equal(relayPollDelayMs(0), RELAY_POLL_MS);
assert.equal(relayPollDelayMs(4), RELAY_POLL_MS, "still base before the grow threshold");
// After the threshold the delay grows but never exceeds the cap.
assert.ok(relayPollDelayMs(5) > RELAY_POLL_MS, "grows once idle");
assert.ok(relayPollDelayMs(5) <= RELAY_POLL_MAX_MS);
assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap");
// Monotonic non-decreasing.
assert.ok(relayPollDelayMs(6) >= relayPollDelayMs(5));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Derive the threshold assertions from the exported threshold constant.

The test assumes the default grow threshold is 5, but RELAY_POLL_IDLE_GROW_AFTER is environment-configurable. Import it and derive the before/at/after checks so the test matches the configured module state.

Suggested test fix
   RELAY_POLL_MAX_MS,
   RELAY_POLL_MS,
+  RELAY_POLL_IDLE_GROW_AFTER,
   relayPollDelayMs,
@@
   it("polls at the base rate while busy, then backs off geometrically up to the cap", () => {
+    const beforeGrow = Math.max(0, RELAY_POLL_IDLE_GROW_AFTER - 1);
+    const growAt = RELAY_POLL_IDLE_GROW_AFTER;
     // No idle polls -> base rate.
     assert.equal(relayPollDelayMs(0), RELAY_POLL_MS);
-    assert.equal(relayPollDelayMs(4), RELAY_POLL_MS, "still base before the grow threshold");
+    assert.equal(relayPollDelayMs(beforeGrow), RELAY_POLL_MS, "still base before the grow threshold");
     // After the threshold the delay grows but never exceeds the cap.
-    assert.ok(relayPollDelayMs(5) > RELAY_POLL_MS, "grows once idle");
-    assert.ok(relayPollDelayMs(5) <= RELAY_POLL_MAX_MS);
+    assert.ok(relayPollDelayMs(growAt) > RELAY_POLL_MS, "grows once idle");
+    assert.ok(relayPollDelayMs(growAt) <= RELAY_POLL_MAX_MS);
     assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap");
     // Monotonic non-decreasing.
-    assert.ok(relayPollDelayMs(6) >= relayPollDelayMs(5));
+    assert.ok(relayPollDelayMs(growAt + 1) >= relayPollDelayMs(growAt));
   });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RELAY_POLL_MAX_MS,
RELAY_POLL_MS,
relayPollDelayMs,
resolvePermission,
startToolRelay,
type RelayResponse,
} from "../../src/tools/relay.ts";
import type { ResolvedToolSpec } from "../../src/protocol.ts";
describe("relayPollDelayMs (idle backoff)", () => {
it("polls at the base rate while busy, then backs off geometrically up to the cap", () => {
// No idle polls -> base rate.
assert.equal(relayPollDelayMs(0), RELAY_POLL_MS);
assert.equal(relayPollDelayMs(4), RELAY_POLL_MS, "still base before the grow threshold");
// After the threshold the delay grows but never exceeds the cap.
assert.ok(relayPollDelayMs(5) > RELAY_POLL_MS, "grows once idle");
assert.ok(relayPollDelayMs(5) <= RELAY_POLL_MAX_MS);
assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap");
// Monotonic non-decreasing.
assert.ok(relayPollDelayMs(6) >= relayPollDelayMs(5));
RELAY_POLL_MAX_MS,
RELAY_POLL_MS,
RELAY_POLL_IDLE_GROW_AFTER,
relayPollDelayMs,
resolvePermission,
startToolRelay,
type RelayResponse,
} from "../../src/tools/relay.ts";
import type { ResolvedToolSpec } from "../../src/protocol.ts";
describe("relayPollDelayMs (idle backoff)", () => {
it("polls at the base rate while busy, then backs off geometrically up to the cap", () => {
const beforeGrow = Math.max(0, RELAY_POLL_IDLE_GROW_AFTER - 1);
const growAt = RELAY_POLL_IDLE_GROW_AFTER;
// No idle polls -> base rate.
assert.equal(relayPollDelayMs(0), RELAY_POLL_MS);
assert.equal(relayPollDelayMs(beforeGrow), RELAY_POLL_MS, "still base before the grow threshold");
// After the threshold the delay grows but never exceeds the cap.
assert.ok(relayPollDelayMs(growAt) > RELAY_POLL_MS, "grows once idle");
assert.ok(relayPollDelayMs(growAt) <= RELAY_POLL_MAX_MS);
assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap");
// Monotonic non-decreasing.
assert.ok(relayPollDelayMs(growAt + 1) >= relayPollDelayMs(growAt));

mmabrouk added a commit that referenced this pull request Jul 4, 2026
…honest skip log

Address review on #5047: the F1 gate now keys on sandbox !== "local" (isRemoteSandbox)
so an unknown or future remote provider (E2B et al.) fails closed with the same loud
error instead of silently re-opening the zero-tools drop; reword the message to say
"remote sandbox". Replace the dead isPi-guarded file-relay log in mcp.ts (unreachable
past the isPi early-return, per Copilot) with an honest "skipped the loopback
advertisement" log that never claims a delivery that is not happening. Add a
fail-closed unit test for an unknown provider (sandbox=e2b). Client-tool exemption
stays; #4985 owns tightening it when client tools move onto the MCP channel.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
@mmabrouk mmabrouk mentioned this pull request Jul 4, 2026
12 tasks
mmabrouk added 5 commits July 4, 2026 19:22
…widgets

TS-only RenderHint member a client tool (request_connection) stamps so the
frontend renders the OAuth/API-key connect dialog when the call pauses.
wire.py does not pin RenderHint, so no wire change.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…isement

Collapse the byte-identical camel/snake input_schema accessor + required-field
walk (dispatch.ts, relay.ts, extensions/agenta.ts) into tools/spec-schema.ts.
Rename PublicToolSpec -> AdvertisedToolSpec (advertisedToolSpecs) to say what
it is: the advertisement shape, client tools included.

Fix: the internal MCP channel's tools/list read only camelCase s.inputSchema,
so every snake-case platform-catalog tool (request_connection, commit_revision)
advertised an EMPTY schema to Claude. It now reads through specInputSchema.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
New engines/sandbox_agent/client-tools.ts owns the client-tool pause once for
both delivery paths: buildClientToolRelay (gate descriptor -> responder verdict
-> latch/markPausedToolCall/emit/recordPendingInteraction), the single
client_tool interaction payload (emitClientToolInteraction), and the ACP
tool-call correlation index Claude's MCP delivery will use. The engine's inline
onClientTool at startToolRelay now consumes the seam (behavior-preserving for
Pi).

responder.ts splits the replayed-conversation store: extractApprovalDecisions
keeps ONLY {approved} envelopes; new extractClientToolOutputs collects raw
browser outputs as a FIFO list per name+args key. Fixes two live bugs: a client
output literally "allow"/"deny" collided with a permission decision in the
shared map, and two identical client calls overwrote each other's stored
output. Peek-vs-take semantics preserved (ACP gate peeks, relay consumes).

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…annel

Claude never saw client tools (request_connection): the internal agenta-tools
channel filtered them out of tools/list, so the model could not call them and
the browser widget never rendered — a silent drop, Pi-only feature. Now:

- mcp-bridge/tool-mcp-http advertise client tools when a clientToolRelay is
  wired (local Claude) and PAUSE a client tools/call through the shared seam:
  the handler returns the MCP_PARKED sentinel, the listener destroys the socket
  with no JSON-RPC body (a result would let Claude settle the call and clobber
  the pending widget), and the turn ends paused. Required args are validated
  first so an under-specified call is a normal MCP tool error (model retries).
- engine: deferred clientToolRelay ref (the MCP server is built before the
  responder exists), ACP tool-call correlation index recorded from the event
  stream (the paused widget attaches to Claude's real tool bubble and its late
  frames are suppressed), and an mcpAbort AbortController fired from the pause
  controller's destroy path + the finally so no in-flight tools/call settles
  after the turn ends.
- run-plan (#5047 gate tightened, as its comment assigned to #4985): a non-Pi
  remote-sandbox run now refuses on ANY custom tool (toolSpecs), client kind
  included — client tools ride the MCP channel now, so on a remote sandbox
  they are exactly as undeliverable as gateway tools. Flipped the
  claude x daytona x client-only test to assert refusal; research.md §4
  updated to match.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
The relay loop polled host.list(relayDir) every 300 ms for the whole turn — on
Daytona a remote ls exec ~3x/s, now also for client-only runs that spend the
turn waiting on a browser-fulfilled pause with no other tool traffic. After 5
consecutive idle polls the delay grows geometrically to a 1.5 s cap
(AGENTA_AGENT_TOOLS_RELAY_POLLING_MAX / _IDLE_GROW_AFTER) and resets on any new
request file, so a real tool call is still picked up promptly.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…c docs

Sync the five docs that described the old behavior (client tools filtered out
of the internal channel's tools/list): ground-truth + tools.md now describe the
shared pause seam and both delivery paths; runner-to-mcp-server.md documents
the tools/list advertisement (specInputSchema), the tools/call pause
(no-result + abort), the tightened REMOTE_TOOLS gate, and the new owned-by
files; the two interface pages update the client-kind comment.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
mmabrouk added 2 commits July 4, 2026 20:15
Correctness fixes from the two review rounds (Codex xhigh + internal) on the
client-tools recut:

- Correlation index: normalize the ACP title's mcp__<server>__ prefix on
  record() so a bare-name lookup() hits (suppression + widget attachment were
  inert live: Claude titles internal-MCP tools mcp__agenta-tools__<name>).
- Consume-on-match: a lookup() consumes its matched id (per-key FIFO, one
  shared entry per call), so a duplicate identical call correlates to its OWN
  ACP id instead of re-homing the first, already-settled one.
- Drop the ACP kind fallback in record(): kind is a category, not a name.
- Single correlation owner: buildClientToolRelay resolves the id once;
  emitClientToolInteraction no longer takes an index.
- Rename MCP_PARKED -> MCP_PAUSED (pause/pendingApproval vocabulary).
- Move the ClientToolRelay contract to a pure type module
  (tools/client-tool-relay.ts); relay.ts keeps a compat re-export for now
  (its own hunks are entangled with a co-session's claimed relay.ts edits).
- Comment honesty: the seam owns the RELAY/MCP emit while acp-interactions
  owns the ACP-gate emit; the MCP abort destroys sockets, not executions
  (signal-threading into dispatch is a known follow-up).
- New tests: prefixed-title lookup, FIFO consumption, duplicate same-id POST
  after a pause is also aborted and never answered.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…ss denied)

_rules_from_tool_specs skipped kind == "client" from the days when client
tools never crossed the agenta-tools channel. On this branch they DO: the
runner advertises them over the internal MCP server and pauses tools/call for
the browser. Without a rule, Claude's own permission gate fires first and the
call falls to the ACP path, bypassing the new pause path.

Client tools now render mcp__agenta-tools__<name> rules: deny when the
effective permission is deny, otherwise allow — including for explicit "ask"
and for unset. The runner-side seam is the authoritative gate for a client
tool: pausing for the browser IS the ask flow, so a Claude-side ask would only
duplicate that gate in a worse place. Executable (callback/code) behavior is
unchanged.

Tests: allow-by-default, allow-on-ask, deny-on-deny, plain-dict client renders
its rule; non-client assertions untouched.

Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
@mmabrouk mmabrouk force-pushed the feat/claude-client-tools branch from 097d14e to 51f0e3f Compare July 4, 2026 18:19
@mmabrouk mmabrouk changed the title feat(agent): deliver client tools to Claude + consolidate the client-tool seam feat(agent): deliver client tools to Claude over the internal MCP channel Jul 4, 2026

@mmabrouk mmabrouk left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review guide (Claude). Five inline comments below walk the load-bearing spots in reading order. Everything else in the diff is mechanical fallout (renames, extracted helpers, doc sync). The PR body has the full recut story and the follow-up list.

* `agenta-tools`, contains no `__`; a server name that did would truncate ambiguously).
*/
function bareToolName(title: string): string {
return title.replace(/^mcp__.+?__/, "");

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. The correlation index, and the bug the internal review caught.

This index maps a tool call back to the ACP frame Claude emitted for it, so the pause can suppress that frame's failure noise (F-024) and the connect widget attaches to the right bubble. The original recut recorded Claude's raw titles (mcp__agenta-tools__request_connection) but looked up bare names — every lookup missed, silently, because a UUID fallback exists. This strip-on-record fix plus consume-on-match (a matched id is retired, so a duplicate identical call correlates to its own frame, not the first one's) are both pinned by tests that feed the prefixed titles Claude actually sends.


/** A client-tool fulfillment output for this exact call, consumed on first take. */
/** The next FIFO client-tool output for this exact call, consumed on take. */
takeClientOutput(gate: GateDescriptor): { found: boolean; output?: unknown } {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. The FIFO client-output store.

Browser answers for client tools now live in their own per-key FIFO, separate from approval decisions. Two live bugs die here: an answer that is literally the string "allow" was misread as a permission decision, and two identical calls overwrote each other's answer. peek (used by the ACP permission gate) never advances the cursor; take (used when the tool call is actually fulfilled) does. The store is rebuilt from history each request, so the cursor resets per turn and the answer survives to the resume turn — the internal review specifically checked this and found it sound.

* interaction (`onClientTool`) and the handler then ends the turn (`onPause` -> the engine's
* pause controller), so the turn ends `paused`.
*/
const MCP_PAUSED = Symbol("mcp-paused");

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Pause by socket destroy.

A JSON-RPC error would settle the tool call, so the handler answers a pausing client tool with nothing: no result, socket destroyed. Claude cannot settle the call and the turn ends paused. This depends on the MCP SDK not retrying POSTs, which Codex verified against the SDK source and a duplicate-POST test now pins at the handler level. Known honest limitation (comment says so): destroying the socket does not cancel a server-side execution already in flight; that AbortSignal threading is a tracked follow-up.

// client tools now ride the same internal MCP channel on local Claude (advertised + paused in
// `tools/call`), so on a remote sandbox they are exactly as undeliverable as gateway tools —
// the model would never see or be able to call them.
if (!isPi && isRemoteSandbox && toolSpecs.length > 0) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4. The gate tightening you pre-approved on #5047.

Client tools now ride the MCP channel, and that channel is unreachable from a remote sandbox, so the client-tool exemption would have reopened the silent drop. The gate now counts all custom tools for non-Pi remote runs. Pi stays exempt: its in-sandbox extension delivers client tools over the file relay on Daytona today. The old claude-x-daytona-x-client-only test flipped to assert refusal.

spec.permission, spec.read_only, permission_default
)
rule = f"mcp__{INTERNAL_TOOL_MCP_SERVER}__{spec.name}"
if spec.kind == "client":

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5. The Codex blocker: Claude needs permission rules for client tools.

Before this fix, client tools got no mcp__agenta-tools__<name> rule, so Claude raised its own generic permission prompt before tools/call and the new pause path never ran. Now they render allow unless explicitly denied. The reasoning for allow-even-on-ask: the runner-side gate is authoritative, and for a client tool its pause IS the ask — the browser answers it. A Claude-side ask would duplicate the same gate in a worse place. Deny still renders deny, so a policy-blocked tool is refused before a round trip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request needs-review Agent updated; awaiting Mahmoud's review refactoring A code change that neither fixes a bug nor adds a feature size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant