feat(agent): deliver client tools to Claude over the internal MCP channel#4985
feat(agent): deliver client tools to Claude over the internal MCP channel#4985mmabrouk wants to merge 8 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR adds a shared client-tool "park/resume" relay used by both Pi and Claude tool-delivery paths, extends the internal loopback MCP channel to advertise and park browser-fulfilled ChangesClient-tool relay, MCP delivery, and Daytona gating
Sequence Diagram(s)sequenceDiagram
participant Claude
participant InternalMcpServer
participant ClientToolRelay
participant Responder
participant Browser
Claude->>InternalMcpServer: tools/call (client tool)
InternalMcpServer->>ClientToolRelay: onClientTool(request)
ClientToolRelay->>Responder: onClientTool(request)
Responder-->>ClientToolRelay: "park" (no stored output)
ClientToolRelay->>ClientToolRelay: emitClientToolInteraction
InternalMcpServer->>InternalMcpServer: abort HTTP response (MCP_PARKED)
Browser->>Responder: deliver browser tool output (next turn)
Claude->>InternalMcpServer: re-call tools/call
InternalMcpServer->>ClientToolRelay: onClientTool(request)
ClientToolRelay->>Responder: takeClientOutput(coldReplayKey)
Responder-->>ClientToolRelay: resolved output
ClientToolRelay-->>InternalMcpServer: output
InternalMcpServer-->>Claude: JSON-RPC result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
🤖 The AI agent says: What to review — two distinct things:
Also flagged in the description: live QA is pending the sidecar deploy, and the inventory index-row table cell couldn't be committed (GitButler hunk-lock against the sibling unassigned |
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
🤖 The AI agent says: Status: implemented, 400 tests green, but not merge-ready yet — please do not merge. Two things gate it. 🔸 Decision needed — base drift / directory rename.
The work and tests are done, so (a) is mostly mechanical. Recommendation: (a). Which do you want? Codex (xhigh) review — 3 fixes before merge:
Verified OK by Codex: the non-Pi Daytona guard scope, the empty-schema accessor fix, and the Pi path still routing through the shared seam. Plan: fix the 3 findings during the |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
services/agent/src/responder.ts (1)
388-436: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winDuplicate full-history scan:
extractApprovalDecisionsandextractClientToolOutputseach rebuild the same indexes.Both functions independently call
buildCallShapeIndex(request)and re-iteratetoolResultBlocks(request)over the entire message history. Persandbox_agent.ts, they're always invoked back-to-back (extractApprovalDecisions(request)thenextractClientToolOutputs(request)), so every turn pays for two full passes overrequest.messages(which grows with conversation length) instead of one.Consider merging them into a single pass that builds
callShapeByIdonce and populates both thedecisionsmap and theoutputsmap from one iteration overtoolResultBlocks(request).♻️ Sketch of a combined extraction
-export function extractApprovalDecisions( - request: AgentRunRequest, -): Map<string, unknown> { - const decisions = new Map<string, unknown>(); - const callShapeById = buildCallShapeIndex(request); - ... - for (const block of toolResultBlocks(request)) { - const decision = approvalDecisionOf(block); - if (decision === undefined) continue; - const argsKey = coldReplayKey(block, callShapeById); - if (argsKey) decisions.set(argsKey, decision); - } - return decisions; -} - -export function extractClientToolOutputs( - request: AgentRunRequest, -): Map<string, unknown[]> { - const outputs = new Map<string, unknown[]>(); - const callShapeById = buildCallShapeIndex(request); - for (const block of toolResultBlocks(request)) { - if (approvalDecisionOf(block) !== undefined) continue; - const argsKey = coldReplayKey(block, callShapeById); - if (!argsKey) continue; - const list = outputs.get(argsKey) ?? []; - list.push(block.output); - outputs.set(argsKey, list); - } - return outputs; -} +export function extractHitlStores(request: AgentRunRequest): { + decisions: Map<string, unknown>; + clientOutputs: Map<string, unknown[]>; +} { + const decisions = new Map<string, unknown>(); + const clientOutputs = new Map<string, unknown[]>(); + const callShapeById = buildCallShapeIndex(request); + for (const block of toolResultBlocks(request)) { + const argsKey = coldReplayKey(block, callShapeById); + if (!argsKey) continue; + const decision = approvalDecisionOf(block); + if (decision !== undefined) { + decisions.set(argsKey, decision); + continue; + } + const list = clientOutputs.get(argsKey) ?? []; + list.push(block.output); + clientOutputs.set(argsKey, list); + } + return { decisions, clientOutputs }; +}services/agent/tests/unit/sandbox-agent-orchestration.test.ts (1)
357-372: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicate union member in the cast.
Lines 369 repeat the identical
{ onClientTool?: unknown; onPark?: unknown }member twice in the union type cast forrelayArgs[6]. This is redundant and looks like a copy/paste artifact; it doesn't break the test but should be deduplicated.🧹 Proposed fix to remove the duplicate union member
const relay = relayArgs[6] as | { onClientTool?: unknown; onPark?: unknown } - | { onClientTool?: unknown; onPark?: unknown } | undefined;
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7b014c75-499d-49a9-b4ef-6162ec928023
📒 Files selected for processing (28)
docs/design/agent-workflows/documentation/ground-truth.mddocs/design/agent-workflows/documentation/tools.mddocs/design/agent-workflows/interfaces/cross-service/runner-to-mcp-server.mddocs/design/agent-workflows/interfaces/in-service/tool-models-and-resolution.mddocs/design/agent-workflows/interfaces/public-edge/agent-config-schema.mdservices/agent/src/engines/sandbox_agent.tsservices/agent/src/engines/sandbox_agent/client-tools.tsservices/agent/src/engines/sandbox_agent/mcp.tsservices/agent/src/engines/sandbox_agent/permissions.tsservices/agent/src/engines/sandbox_agent/pi-assets.tsservices/agent/src/engines/sandbox_agent/run-plan.tsservices/agent/src/extensions/agenta.tsservices/agent/src/protocol.tsservices/agent/src/responder.tsservices/agent/src/tools/dispatch.tsservices/agent/src/tools/mcp-bridge.tsservices/agent/src/tools/public-spec.tsservices/agent/src/tools/relay.tsservices/agent/src/tools/spec-schema.tsservices/agent/src/tools/tool-mcp-http.tsservices/agent/tests/unit/client-tools.test.tsservices/agent/tests/unit/responder.test.tsservices/agent/tests/unit/sandbox-agent-orchestration.test.tsservices/agent/tests/unit/sandbox-agent-pi-assets.test.tsservices/agent/tests/unit/sandbox-agent-run-plan.test.tsservices/agent/tests/unit/spec-schema.test.tsservices/agent/tests/unit/tool-bridge.test.tsservices/agent/tests/unit/tool-relay-permission.test.ts
| const argsKey = parkedCallKey(name, u.rawInput); | ||
| if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId); | ||
| if (name && !byName.has(name)) byName.set(name, toolCallId); | ||
| }, | ||
| lookup(toolName, input) { | ||
| const argsKey = parkedCallKey(toolName, input); | ||
| if (argsKey) { | ||
| const hit = byArgsKey.get(argsKey); | ||
| if (hit) return hit; | ||
| } | ||
| if (toolName) { | ||
| const hit = byName.get(toolName); | ||
| if (hit) return hit; | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Avoid correlating by bare tool name after an exact-args miss.
Line 72 falls back to the first call with the same name even when parkedCallKey(toolName, input) missed. That defeats the fail-closed correlation contract and can attach the client_tool interaction to the wrong ACP tool-call bubble for repeated/different request_connection calls. Keep the name fallback only for unambiguous records that lacked raw input, and return undefined on explicit-args mismatches.
Suggested direction
export function createToolCallCorrelationIndex(): ToolCallCorrelationIndex {
const byArgsKey = new Map<string, string>();
- const byName = new Map<string, string>();
+ const byNameFallback = new Map<string, { id: string; ambiguous: boolean }>();
+ const namesWithExplicitArgs = new Set<string>();
return {
@@
- const argsKey = parkedCallKey(name, u.rawInput);
+ const hasExplicitArgs = u.rawInput !== undefined && u.rawInput !== null;
+ if (name && hasExplicitArgs) namesWithExplicitArgs.add(name);
+ const argsKey = parkedCallKey(name, u.rawInput);
if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId);
- if (name && !byName.has(name)) byName.set(name, toolCallId);
+ if (name) {
+ const existing = byNameFallback.get(name);
+ if (!existing) byNameFallback.set(name, { id: toolCallId, ambiguous: false });
+ else existing.ambiguous = true;
+ }
@@
if (toolName) {
- const hit = byName.get(toolName);
- if (hit) return hit;
+ if (namesWithExplicitArgs.has(toolName)) return undefined;
+ const fallback = byNameFallback.get(toolName);
+ if (fallback && !fallback.ambiguous) return fallback.id;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const argsKey = parkedCallKey(name, u.rawInput); | |
| if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId); | |
| if (name && !byName.has(name)) byName.set(name, toolCallId); | |
| }, | |
| lookup(toolName, input) { | |
| const argsKey = parkedCallKey(toolName, input); | |
| if (argsKey) { | |
| const hit = byArgsKey.get(argsKey); | |
| if (hit) return hit; | |
| } | |
| if (toolName) { | |
| const hit = byName.get(toolName); | |
| if (hit) return hit; | |
| } | |
| const byArgsKey = new Map<string, string>(); | |
| const byNameFallback = new Map<string, { id: string; ambiguous: boolean }>(); | |
| const namesWithExplicitArgs = new Set<string>(); | |
| return { | |
| ... | |
| const hasExplicitArgs = u.rawInput !== undefined && u.rawInput !== null; | |
| if (name && hasExplicitArgs) namesWithExplicitArgs.add(name); | |
| const argsKey = parkedCallKey(name, u.rawInput); | |
| if (argsKey && !byArgsKey.has(argsKey)) byArgsKey.set(argsKey, toolCallId); | |
| if (name) { | |
| const existing = byNameFallback.get(name); | |
| if (!existing) byNameFallback.set(name, { id: toolCallId, ambiguous: false }); | |
| else existing.ambiguous = true; | |
| } | |
| }, | |
| lookup(toolName, input) { | |
| const argsKey = parkedCallKey(toolName, input); | |
| if (argsKey) { | |
| const hit = byArgsKey.get(argsKey); | |
| if (hit) return hit; | |
| } | |
| if (toolName) { | |
| if (namesWithExplicitArgs.has(toolName)) return undefined; | |
| const fallback = byNameFallback.get(toolName); | |
| if (fallback && !fallback.ambiguous) return fallback.id; | |
| } |
| /** | ||
| * A parked call reply. Permission responses use `{ approved: boolean }`; client tools carry their | ||
| * real structured `output`. | ||
| * An approval reply uses an `{ approved: boolean }` envelope (the Vercel adapter's | ||
| * `_approval_response_blocks` shape), unique to a permission response. Returns `"allow"`/`"deny"` | ||
| * for one, or `undefined` for any other tool_result (a real browser/client output). | ||
| */ | ||
| function parkedCallResultOf(block: ContentBlock): { found: boolean; output?: unknown } { | ||
| if (!block || block.type !== "tool_result") return { found: false }; | ||
| function approvalDecisionOf(block: ContentBlock): PermissionDecision | undefined { | ||
| if (!block || block.type !== "tool_result") return undefined; | ||
| const output = block.output; | ||
| if ( | ||
| output && | ||
| typeof output === "object" && | ||
| !Array.isArray(output) && | ||
| typeof (output as { approved?: unknown }).approved === "boolean" | ||
| ) { | ||
| return { | ||
| found: true, | ||
| output: (output as { approved: boolean }).approved ? "allow" : "deny", | ||
| }; | ||
| return (output as { approved: boolean }).approved ? "allow" : "deny"; | ||
| } | ||
| return { found: true, output }; | ||
| return undefined; |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Approval vs. client-output discriminator relies on a fragile value-shape heuristic.
approvalDecisionOf classifies a tool_result as an approval reply purely by checking whether output is an object with a boolean approved field. This is the exact class of problem the FIFO split was built to fix for the literal "allow"/"deny" string case (per the docstring on ClientToolOutputs, Line 144-147), but it isn't fixed here: a genuine client-tool output that happens to include an approved: boolean field (e.g. a future request_connection/commit_revision-style tool, or any new client tool whose result schema includes that key) will be:
- excluded from
extractClientToolOutputs(Line 425:if (approvalDecisionOf(block) !== undefined) continue;), - stored only in
decisions(consulted exclusively byonPermission), - never found by
onClientTool/takeClientOutput(which only readsclientOutputs),
causing that call to permanently re-park (or deny, headless) even though the browser already produced a real result. Since the only signal available here is the raw output value (no explicit kind discriminator travels with the tool_result block), this collision can't be ruled out for tools whose output schema isn't controlled by this code.
Consider carrying an explicit discriminator on the wire (e.g., a ContentBlock flag distinguishing an approval-envelope reply from a generic tool output) rather than inferring it from the output's shape.
| const responses = await Promise.all( | ||
| parsed.map((m) => | ||
| handle(m, specByName, specs, relayDir, clientToolRelay, log), | ||
| ), | ||
| ); | ||
| // A parked client tool in the batch aborts the whole request (no result for any). | ||
| if (responses.some((r) => r === MCP_PARKED)) { | ||
| abortParked(res); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Do not start side-effecting batch tools before checking for a parked client call.
Promise.all starts every batch item concurrently. If one client tool parks, Lines 319-321 abort the whole response, but any callback/code tool already started may have mutated state and then receive no JSON-RPC result, making a retry duplicate the side effect. Handle client-tool calls first and short-circuit on MCP_PARKED before dispatching executable tools, or reject mixed batches.
Suggested direction
- const responses = await Promise.all(
- parsed.map((m) =>
- handle(m, specByName, specs, relayDir, clientToolRelay, log),
- ),
- );
- // A parked client tool in the batch aborts the whole request (no result for any).
- if (responses.some((r) => r === MCP_PARKED)) {
- abortParked(res);
- return;
- }
+ const messages = parsed as any[];
+ const clientMessages: any[] = [];
+ const otherMessages: any[] = [];
+ for (const m of messages) {
+ const spec =
+ m?.method === "tools/call" ? specByName.get(m?.params?.name) : undefined;
+ ((spec?.kind ?? "callback") === "client"
+ ? clientMessages
+ : otherMessages
+ ).push(m);
+ }
+
+ const responses: unknown[] = [];
+ for (const m of [...clientMessages, ...otherMessages]) {
+ const response = await handle(
+ m,
+ specByName,
+ specs,
+ relayDir,
+ clientToolRelay,
+ log,
+ );
+ if (response === MCP_PARKED) {
+ abortParked(res);
+ return;
+ }
+ responses.push(response);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const responses = await Promise.all( | |
| parsed.map((m) => | |
| handle(m, specByName, specs, relayDir, clientToolRelay, log), | |
| ), | |
| ); | |
| // A parked client tool in the batch aborts the whole request (no result for any). | |
| if (responses.some((r) => r === MCP_PARKED)) { | |
| abortParked(res); | |
| const messages = parsed as any[]; | |
| const clientMessages: any[] = []; | |
| const otherMessages: any[] = []; | |
| for (const m of messages) { | |
| const spec = | |
| m?.method === "tools/call" ? specByName.get(m?.params?.name) : undefined; | |
| ((spec?.kind ?? "callback") === "client" | |
| ? clientMessages | |
| : otherMessages | |
| ).push(m); | |
| } | |
| const responses: unknown[] = []; | |
| for (const m of [...clientMessages, ...otherMessages]) { | |
| const response = await handle( | |
| m, | |
| specByName, | |
| specs, | |
| relayDir, | |
| clientToolRelay, | |
| log, | |
| ); | |
| if (response === MCP_PARKED) { | |
| abortParked(res); | |
| return; | |
| } | |
| responses.push(response); | |
| } |
| RELAY_POLL_MAX_MS, | ||
| RELAY_POLL_MS, | ||
| relayPollDelayMs, | ||
| resolvePermission, | ||
| startToolRelay, | ||
| type RelayResponse, | ||
| } from "../../src/tools/relay.ts"; | ||
| import type { ResolvedToolSpec } from "../../src/protocol.ts"; | ||
|
|
||
| describe("relayPollDelayMs (idle backoff)", () => { | ||
| it("polls at the base rate while busy, then backs off geometrically up to the cap", () => { | ||
| // No idle polls -> base rate. | ||
| assert.equal(relayPollDelayMs(0), RELAY_POLL_MS); | ||
| assert.equal(relayPollDelayMs(4), RELAY_POLL_MS, "still base before the grow threshold"); | ||
| // After the threshold the delay grows but never exceeds the cap. | ||
| assert.ok(relayPollDelayMs(5) > RELAY_POLL_MS, "grows once idle"); | ||
| assert.ok(relayPollDelayMs(5) <= RELAY_POLL_MAX_MS); | ||
| assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap"); | ||
| // Monotonic non-decreasing. | ||
| assert.ok(relayPollDelayMs(6) >= relayPollDelayMs(5)); |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Derive the threshold assertions from the exported threshold constant.
The test assumes the default grow threshold is 5, but RELAY_POLL_IDLE_GROW_AFTER is environment-configurable. Import it and derive the before/at/after checks so the test matches the configured module state.
Suggested test fix
RELAY_POLL_MAX_MS,
RELAY_POLL_MS,
+ RELAY_POLL_IDLE_GROW_AFTER,
relayPollDelayMs,
@@
it("polls at the base rate while busy, then backs off geometrically up to the cap", () => {
+ const beforeGrow = Math.max(0, RELAY_POLL_IDLE_GROW_AFTER - 1);
+ const growAt = RELAY_POLL_IDLE_GROW_AFTER;
// No idle polls -> base rate.
assert.equal(relayPollDelayMs(0), RELAY_POLL_MS);
- assert.equal(relayPollDelayMs(4), RELAY_POLL_MS, "still base before the grow threshold");
+ assert.equal(relayPollDelayMs(beforeGrow), RELAY_POLL_MS, "still base before the grow threshold");
// After the threshold the delay grows but never exceeds the cap.
- assert.ok(relayPollDelayMs(5) > RELAY_POLL_MS, "grows once idle");
- assert.ok(relayPollDelayMs(5) <= RELAY_POLL_MAX_MS);
+ assert.ok(relayPollDelayMs(growAt) > RELAY_POLL_MS, "grows once idle");
+ assert.ok(relayPollDelayMs(growAt) <= RELAY_POLL_MAX_MS);
assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap");
// Monotonic non-decreasing.
- assert.ok(relayPollDelayMs(6) >= relayPollDelayMs(5));
+ assert.ok(relayPollDelayMs(growAt + 1) >= relayPollDelayMs(growAt));
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RELAY_POLL_MAX_MS, | |
| RELAY_POLL_MS, | |
| relayPollDelayMs, | |
| resolvePermission, | |
| startToolRelay, | |
| type RelayResponse, | |
| } from "../../src/tools/relay.ts"; | |
| import type { ResolvedToolSpec } from "../../src/protocol.ts"; | |
| describe("relayPollDelayMs (idle backoff)", () => { | |
| it("polls at the base rate while busy, then backs off geometrically up to the cap", () => { | |
| // No idle polls -> base rate. | |
| assert.equal(relayPollDelayMs(0), RELAY_POLL_MS); | |
| assert.equal(relayPollDelayMs(4), RELAY_POLL_MS, "still base before the grow threshold"); | |
| // After the threshold the delay grows but never exceeds the cap. | |
| assert.ok(relayPollDelayMs(5) > RELAY_POLL_MS, "grows once idle"); | |
| assert.ok(relayPollDelayMs(5) <= RELAY_POLL_MAX_MS); | |
| assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap"); | |
| // Monotonic non-decreasing. | |
| assert.ok(relayPollDelayMs(6) >= relayPollDelayMs(5)); | |
| RELAY_POLL_MAX_MS, | |
| RELAY_POLL_MS, | |
| RELAY_POLL_IDLE_GROW_AFTER, | |
| relayPollDelayMs, | |
| resolvePermission, | |
| startToolRelay, | |
| type RelayResponse, | |
| } from "../../src/tools/relay.ts"; | |
| import type { ResolvedToolSpec } from "../../src/protocol.ts"; | |
| describe("relayPollDelayMs (idle backoff)", () => { | |
| it("polls at the base rate while busy, then backs off geometrically up to the cap", () => { | |
| const beforeGrow = Math.max(0, RELAY_POLL_IDLE_GROW_AFTER - 1); | |
| const growAt = RELAY_POLL_IDLE_GROW_AFTER; | |
| // No idle polls -> base rate. | |
| assert.equal(relayPollDelayMs(0), RELAY_POLL_MS); | |
| assert.equal(relayPollDelayMs(beforeGrow), RELAY_POLL_MS, "still base before the grow threshold"); | |
| // After the threshold the delay grows but never exceeds the cap. | |
| assert.ok(relayPollDelayMs(growAt) > RELAY_POLL_MS, "grows once idle"); | |
| assert.ok(relayPollDelayMs(growAt) <= RELAY_POLL_MAX_MS); | |
| assert.equal(relayPollDelayMs(100), RELAY_POLL_MAX_MS, "saturates at the cap"); | |
| // Monotonic non-decreasing. | |
| assert.ok(relayPollDelayMs(growAt + 1) >= relayPollDelayMs(growAt)); |
…honest skip log Address review on #5047: the F1 gate now keys on sandbox !== "local" (isRemoteSandbox) so an unknown or future remote provider (E2B et al.) fails closed with the same loud error instead of silently re-opening the zero-tools drop; reword the message to say "remote sandbox". Replace the dead isPi-guarded file-relay log in mcp.ts (unreachable past the isPi early-return, per Copilot) with an honest "skipped the loopback advertisement" log that never claims a delivery that is not happening. Add a fail-closed unit test for an unknown provider (sandbox=e2b). Client-tool exemption stays; #4985 owns tightening it when client tools move onto the MCP channel. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…widgets TS-only RenderHint member a client tool (request_connection) stamps so the frontend renders the OAuth/API-key connect dialog when the call pauses. wire.py does not pin RenderHint, so no wire change. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…isement Collapse the byte-identical camel/snake input_schema accessor + required-field walk (dispatch.ts, relay.ts, extensions/agenta.ts) into tools/spec-schema.ts. Rename PublicToolSpec -> AdvertisedToolSpec (advertisedToolSpecs) to say what it is: the advertisement shape, client tools included. Fix: the internal MCP channel's tools/list read only camelCase s.inputSchema, so every snake-case platform-catalog tool (request_connection, commit_revision) advertised an EMPTY schema to Claude. It now reads through specInputSchema. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
New engines/sandbox_agent/client-tools.ts owns the client-tool pause once for
both delivery paths: buildClientToolRelay (gate descriptor -> responder verdict
-> latch/markPausedToolCall/emit/recordPendingInteraction), the single
client_tool interaction payload (emitClientToolInteraction), and the ACP
tool-call correlation index Claude's MCP delivery will use. The engine's inline
onClientTool at startToolRelay now consumes the seam (behavior-preserving for
Pi).
responder.ts splits the replayed-conversation store: extractApprovalDecisions
keeps ONLY {approved} envelopes; new extractClientToolOutputs collects raw
browser outputs as a FIFO list per name+args key. Fixes two live bugs: a client
output literally "allow"/"deny" collided with a permission decision in the
shared map, and two identical client calls overwrote each other's stored
output. Peek-vs-take semantics preserved (ACP gate peeks, relay consumes).
Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…annel Claude never saw client tools (request_connection): the internal agenta-tools channel filtered them out of tools/list, so the model could not call them and the browser widget never rendered — a silent drop, Pi-only feature. Now: - mcp-bridge/tool-mcp-http advertise client tools when a clientToolRelay is wired (local Claude) and PAUSE a client tools/call through the shared seam: the handler returns the MCP_PARKED sentinel, the listener destroys the socket with no JSON-RPC body (a result would let Claude settle the call and clobber the pending widget), and the turn ends paused. Required args are validated first so an under-specified call is a normal MCP tool error (model retries). - engine: deferred clientToolRelay ref (the MCP server is built before the responder exists), ACP tool-call correlation index recorded from the event stream (the paused widget attaches to Claude's real tool bubble and its late frames are suppressed), and an mcpAbort AbortController fired from the pause controller's destroy path + the finally so no in-flight tools/call settles after the turn ends. - run-plan (#5047 gate tightened, as its comment assigned to #4985): a non-Pi remote-sandbox run now refuses on ANY custom tool (toolSpecs), client kind included — client tools ride the MCP channel now, so on a remote sandbox they are exactly as undeliverable as gateway tools. Flipped the claude x daytona x client-only test to assert refusal; research.md §4 updated to match. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
The relay loop polled host.list(relayDir) every 300 ms for the whole turn — on Daytona a remote ls exec ~3x/s, now also for client-only runs that spend the turn waiting on a browser-fulfilled pause with no other tool traffic. After 5 consecutive idle polls the delay grows geometrically to a 1.5 s cap (AGENTA_AGENT_TOOLS_RELAY_POLLING_MAX / _IDLE_GROW_AFTER) and resets on any new request file, so a real tool call is still picked up promptly. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…c docs Sync the five docs that described the old behavior (client tools filtered out of the internal channel's tools/list): ground-truth + tools.md now describe the shared pause seam and both delivery paths; runner-to-mcp-server.md documents the tools/list advertisement (specInputSchema), the tools/call pause (no-result + abort), the tightened REMOTE_TOOLS gate, and the new owned-by files; the two interface pages update the client-kind comment. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
Correctness fixes from the two review rounds (Codex xhigh + internal) on the client-tools recut: - Correlation index: normalize the ACP title's mcp__<server>__ prefix on record() so a bare-name lookup() hits (suppression + widget attachment were inert live: Claude titles internal-MCP tools mcp__agenta-tools__<name>). - Consume-on-match: a lookup() consumes its matched id (per-key FIFO, one shared entry per call), so a duplicate identical call correlates to its OWN ACP id instead of re-homing the first, already-settled one. - Drop the ACP kind fallback in record(): kind is a category, not a name. - Single correlation owner: buildClientToolRelay resolves the id once; emitClientToolInteraction no longer takes an index. - Rename MCP_PARKED -> MCP_PAUSED (pause/pendingApproval vocabulary). - Move the ClientToolRelay contract to a pure type module (tools/client-tool-relay.ts); relay.ts keeps a compat re-export for now (its own hunks are entangled with a co-session's claimed relay.ts edits). - Comment honesty: the seam owns the RELAY/MCP emit while acp-interactions owns the ACP-gate emit; the MCP abort destroys sockets, not executions (signal-threading into dispatch is a known follow-up). - New tests: prefixed-title lookup, FIFO consumption, duplicate same-id POST after a pause is also aborted and never answered. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
…ss denied) _rules_from_tool_specs skipped kind == "client" from the days when client tools never crossed the agenta-tools channel. On this branch they DO: the runner advertises them over the internal MCP server and pauses tools/call for the browser. Without a rule, Claude's own permission gate fires first and the call falls to the ACP path, bypassing the new pause path. Client tools now render mcp__agenta-tools__<name> rules: deny when the effective permission is deny, otherwise allow — including for explicit "ask" and for unset. The runner-side seam is the authoritative gate for a client tool: pausing for the browser IS the ask flow, so a Claude-side ask would only duplicate that gate in a worse place. Executable (callback/code) behavior is unchanged. Tests: allow-by-default, allow-on-ask, deny-on-deny, plain-dict client renders its rule; non-client assertions untouched. Claude-Session: https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT
097d14e to
51f0e3f
Compare
mmabrouk
left a comment
There was a problem hiding this comment.
Review guide (Claude). Five inline comments below walk the load-bearing spots in reading order. Everything else in the diff is mechanical fallout (renames, extracted helpers, doc sync). The PR body has the full recut story and the follow-up list.
| * `agenta-tools`, contains no `__`; a server name that did would truncate ambiguously). | ||
| */ | ||
| function bareToolName(title: string): string { | ||
| return title.replace(/^mcp__.+?__/, ""); |
There was a problem hiding this comment.
1. The correlation index, and the bug the internal review caught.
This index maps a tool call back to the ACP frame Claude emitted for it, so the pause can suppress that frame's failure noise (F-024) and the connect widget attaches to the right bubble. The original recut recorded Claude's raw titles (mcp__agenta-tools__request_connection) but looked up bare names — every lookup missed, silently, because a UUID fallback exists. This strip-on-record fix plus consume-on-match (a matched id is retired, so a duplicate identical call correlates to its own frame, not the first one's) are both pinned by tests that feed the prefixed titles Claude actually sends.
|
|
||
| /** A client-tool fulfillment output for this exact call, consumed on first take. */ | ||
| /** The next FIFO client-tool output for this exact call, consumed on take. */ | ||
| takeClientOutput(gate: GateDescriptor): { found: boolean; output?: unknown } { |
There was a problem hiding this comment.
2. The FIFO client-output store.
Browser answers for client tools now live in their own per-key FIFO, separate from approval decisions. Two live bugs die here: an answer that is literally the string "allow" was misread as a permission decision, and two identical calls overwrote each other's answer. peek (used by the ACP permission gate) never advances the cursor; take (used when the tool call is actually fulfilled) does. The store is rebuilt from history each request, so the cursor resets per turn and the answer survives to the resume turn — the internal review specifically checked this and found it sound.
| * interaction (`onClientTool`) and the handler then ends the turn (`onPause` -> the engine's | ||
| * pause controller), so the turn ends `paused`. | ||
| */ | ||
| const MCP_PAUSED = Symbol("mcp-paused"); |
There was a problem hiding this comment.
3. Pause by socket destroy.
A JSON-RPC error would settle the tool call, so the handler answers a pausing client tool with nothing: no result, socket destroyed. Claude cannot settle the call and the turn ends paused. This depends on the MCP SDK not retrying POSTs, which Codex verified against the SDK source and a duplicate-POST test now pins at the handler level. Known honest limitation (comment says so): destroying the socket does not cancel a server-side execution already in flight; that AbortSignal threading is a tracked follow-up.
| // client tools now ride the same internal MCP channel on local Claude (advertised + paused in | ||
| // `tools/call`), so on a remote sandbox they are exactly as undeliverable as gateway tools — | ||
| // the model would never see or be able to call them. | ||
| if (!isPi && isRemoteSandbox && toolSpecs.length > 0) { |
There was a problem hiding this comment.
4. The gate tightening you pre-approved on #5047.
Client tools now ride the MCP channel, and that channel is unreachable from a remote sandbox, so the client-tool exemption would have reopened the silent drop. The gate now counts all custom tools for non-Pi remote runs. Pi stays exempt: its in-sandbox extension delivers client tools over the file relay on Daytona today. The old claude-x-daytona-x-client-only test flipped to assert refusal.
| spec.permission, spec.read_only, permission_default | ||
| ) | ||
| rule = f"mcp__{INTERNAL_TOOL_MCP_SERVER}__{spec.name}" | ||
| if spec.kind == "client": |
There was a problem hiding this comment.
5. The Codex blocker: Claude needs permission rules for client tools.
Before this fix, client tools got no mcp__agenta-tools__<name> rule, so Claude raised its own generic permission prompt before tools/call and the new pause path never ran. Now they render allow unless explicitly denied. The reasoning for allow-even-on-ask: the runner-side gate is authoritative, and for a client tool its pause IS the ask — the browser answers it. A Claude-side ask would duplicate the same gate in a worse place. Deny still renders deny, so a policy-blocked tool is refused before a round trip.
What this does
Client tools are browser-fulfilled tools like
request_connection: the model calls the tool, the run pauses, a human answers in the Agenta UI, and the next turn resumes with the stored answer. Before this PR they worked on the Pi harness only. On Claude the tool was filtered out of the MCP advertisement, so the model never saw it and the run silently delivered nothing.After this PR, Claude sees client tools over the internal
agenta-toolsMCP server and pauses correctly when it calls one. Both harnesses now pause through one shared seam instead of two divergent code paths.How the pause works on Claude
Claude calls the tool over loopback HTTP (JSON-RPC
tools/call). Answering with an error would settle the call, so the runner instead emits no result and destroys the socket (MCP_PAUSED). Claude cannot settle the call, the turn ends paused, and the browser answer resumes it next turn. The MCP SDK does not retry POSTs, and a duplicate-POST regression test pins that assumption.This is a recut, not a rebase
The original branch predates the approval-boundary merge, which rewrote every seam this PR touches (responder, permissions, pausing). The branch was rebuilt commit-by-commit against the new architecture:
ConversationDecisions.Reviewed before reaching you
Codex (xhigh) plus an internal deep-review pass ran against the branch; both blockers they found are fixed here:
tools/call, bypassing the new pause path. Fixed inclaude_settings.py: client tools renderallowunless explicitly denied — the runner-side gate is the authoritative one, and its pause is the ask flow for a client tool.mcp__agenta-tools__<name>titles but was queried with bare names, so pause suppression (F-024) and widget attachment silently fell back to a UUID no ACP frame carries. Fixed: titles are normalized on record, and lookups consume the matched id so a duplicate identical call correlates to its own frame. Both fixes carry regression tests that would have caught the bugs.Also in the fix pass:
MCP_PARKEDrenamed toMCP_PAUSED, theClientToolRelaytype moved to a neutral module, and over-claiming comments corrected.What to review (inline comments mark each spot)
Read in this order: the seam (
client-tools.ts), the FIFO store (responder.ts), the pause mechanism (tool-mcp-http.ts), the gate change (run-plan.ts), the settings change (claude_settings.py). Five inline comments walk you through the reasoning at each spot.Gate change you already ruled on
#5047's remote-tools gate exempted client tools because they never rode the MCP channel. This PR makes them ride it, so the exemption would reopen the silent drop on remote sandboxes. The gate now counts all custom tools for non-Pi remote runs; Pi stays exempt (its extension delivers via the file relay). The old "allows claude x daytona x client-only" test now asserts refusal.
Tests
Runner: 516 passed (42 files), typecheck clean. Python SDK: 515 passed in the agents suite, wire contract unchanged (goldens untouched).
Known follow-ups (deliberate, not forgotten)
relay.ts) are held back because a concurrent lane (pi-builtin-gating) has uncommitted work in those files; they land as a tiny follow-up once that lane commits.tools/callwith a paused sibling call is latent (Claude does not batch today) — documented, not fixed.request_connectionround trip) still pending; the internal reviewer recommends it before merge and I agree.https://claude.ai/code/session_01HhBEUFbXETNjYdcz71AGrT