Skip to content
190 changes: 129 additions & 61 deletions docs/toolhive/concepts/observability.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,56 +31,75 @@ Here's what a trace looks like when a client calls a tool in the GitHub MCP
server (some fields omitted for brevity):

```text
Span: mcp.tools/call (150ms)
├── service.name: toolhive-mcp-proxy
Span: tools/call create_issue (150ms)
├── service.name: thv-github
├── service.version: v0.1.9
├── http.duration_ms: 150.3
├── http.host: localhost:14972
├── http.method: POST
├── http.request_content_length: 256
├── http.response_content_length: 1024
├── http.status_code: 202
├── http.url: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
├── http.user_agent: claude-code/1.0.53
├── mcp.method: tools/call
├── mcp.request.id: 5
├── http.request.method: POST
├── http.request.body.size: 256
├── http.response.status_code: 202
├── http.response.body.size: 1024
├── url.full: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
├── url.path: /messages
├── url.scheme: https
├── server.address: localhost:14972
├── user_agent.original: claude-code/1.0.53
├── mcp.method.name: tools/call
├── mcp.server.name: github
├── mcp.tool.arguments: owner=stacklok, repo=toolhive, pullNumber=1131
├── mcp.tool.name: create_issue
└── mcp.transport: stdio
├── mcp.session.id: abc123
├── rpc.system.name: jsonrpc
├── jsonrpc.protocol.version: 2.0
├── jsonrpc.request.id: 5
├── gen_ai.tool.name: create_issue
├── gen_ai.operation.name: execute_tool
├── gen_ai.tool.call.arguments: owner=stacklok, repo=toolhive, pullNumber=1131
├── network.transport: tcp
└── network.protocol.name: http
```

### MCP-specific traces

ToolHive automatically captures traces for all MCP operations, including:

- **Tool calls** (`mcp.tools/call`) - When AI assistants use tools
- **Resource access** (`mcp.resources/read`) - When servers read files or data
- **Prompt operations** (`mcp.prompts/get`) - When servers retrieve prompts
- **Connection events** (`mcp.initialize`) - When clients connect to servers
- **Tool calls** (`tools/call`) - When AI assistants use tools
- **Resource access** (`resources/read`) - When servers read files or data
- **Prompt operations** (`prompts/get`) - When servers retrieve prompts
- **Connection events** (`initialize`) - When clients connect to servers

### Trace attributes

Each trace includes detailed context across three layers:
Each trace includes detailed context across several layers:

#### Service information

```text
service.name: toolhive-mcp-proxy
service.name: thv-github
service.version: v0.1.9
host.name: my-machine
```

#### HTTP layer information
#### HTTP layer

```text
http.duration_ms: 150.3
http.host: localhost:14972
http.method: POST
http.request_content_length: 256
http.response_content_length: 1024
http.status_code: 202
http.url: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
http.user_agent: claude-code/1.0.53
http.request.method: POST
http.request.body.size: 256
http.response.status_code: 202
http.response.body.size: 1024
url.full: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
url.path: /messages
url.scheme: https
url.query: session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
server.address: localhost:14972
user_agent.original: claude-code/1.0.53
```

#### Network layer

```text
network.transport: tcp
network.protocol.name: http
network.protocol.version: 1.1
client.address: 127.0.0.1
client.port: 52431
```

#### MCP protocol details
Expand All @@ -89,31 +108,41 @@ Details about the MCP operation being performed (some fields are specific to
each operation):

```text
mcp.client.name: claude-code
mcp.method: tools/call
mcp.request.id: 123
mcp.method.name: tools/call
mcp.server.name: github
mcp.tool.arguments: owner=stacklok, repo=toolhive, path=pkg/telemetry/middleware.go, start_index=130, max_length=1000
mcp.tool.name: get_file_contents
mcp.transport: stdio
rpc.service: mcp
rpc.system: jsonrpc
mcp.session.id: abc123
mcp.protocol.version: 2025-03-26
rpc.system.name: jsonrpc
jsonrpc.protocol.version: 2.0
jsonrpc.request.id: 123
```

#### Method-specific attributes

- **`mcp.tools/call`** traces include:
- `mcp.tool.name` - The name of the tool being called
- `mcp.tool.arguments` - Sanitized tool arguments (sensitive values redacted)
- **`tools/call`** traces include:
- `gen_ai.tool.name` - The name of the tool being called
- `gen_ai.operation.name` - Set to `execute_tool`
- `gen_ai.tool.call.arguments` - Sanitized tool arguments (sensitive values
redacted)

- **`mcp.resources/read`** traces include:
- **`resources/read`** traces include:
- `mcp.resource.uri` - The URI of the resource being accessed

- **`mcp.prompts/get`** traces include:
- `mcp.prompt.name` - The name of the prompt being retrieved
- **`prompts/get`** traces include:
- `gen_ai.prompt.name` - The name of the prompt being retrieved

- **`initialize`** traces include:
- `mcp.client.name` - The name of the connecting client (always emitted)
- `mcp.protocol.version` - The MCP protocol version negotiated

- **`mcp.initialize`** traces include:
- `mcp.client.name` - The name of the connecting client
:::note[Legacy attribute names]

By default, ToolHive emits both the new OpenTelemetry semantic convention
attribute names shown above and legacy attribute names (e.g., `http.method`,
`mcp.method`, `mcp.tool.name`) for backward compatibility with existing
dashboards. You can control this with the `--otel-use-legacy-attributes` flag.

:::

## Metrics collection

Expand Down Expand Up @@ -171,6 +200,45 @@ toolhive_mcp_tool_calls_total{server="github",status="success",tool="list_pull_r
toolhive_mcp_tool_calls_total{server="github",status="success",tool="search_issues"} 2
```

### MCP semantic convention metrics

In addition to the ToolHive-prefixed metrics above, ToolHive emits metrics that
follow the
[OpenTelemetry MCP semantic conventions](https://github.com/open-telemetry/semantic-conventions):

| Metric | Type | Description |
| ------------------------------- | --------- | ---------------------------------------- |
| `mcp.server.operation.duration` | Histogram | Duration of MCP server operations |
| `mcp.client.operation.duration` | Histogram | Duration of MCP client operations (vMCP) |

These metric names follow the OpenTelemetry MCP semantic conventions in OTLP
exports and use the same labels as the ToolHive-prefixed metrics. When exposed
via the Prometheus `/metrics` endpoint, their names are converted to
Prometheus-safe form by replacing dots (`.`) with underscores (`_`):

- `mcp.server.operation.duration` → `mcp_server_operation_duration`
- `mcp.client.operation.duration` → `mcp_client_operation_duration`

### vMCP metrics

When using Virtual MCP Server (vMCP), additional metrics are available for
monitoring backend operations, workflow executions, and optimizer performance.
For details, see the
[vMCP telemetry guide](../guides-vmcp/telemetry-and-metrics.mdx).

## Trace context propagation

ToolHive supports two methods of trace context propagation:

- **HTTP headers**: Standard W3C Trace Context (`traceparent` and `tracestate`
headers) and W3C Baggage propagation
- **MCP `_meta` field**: Trace context embedded in MCP request parameters via
the `params._meta` field, following the MCP specification

When both are present, the MCP `_meta` trace context takes priority. This
enables proper trace correlation across MCP server boundaries, even when MCP
clients inject trace context into the request payload rather than HTTP headers.

## Export options

ToolHive supports multiple export formats to integrate with your existing
Expand Down Expand Up @@ -217,13 +285,13 @@ ToolHive automatically protects sensitive information in traces:
For example, a tool call with sensitive arguments:

```text
mcp.tool.arguments: password=secret123, api_key=abc456, title=Bug report
gen_ai.tool.call.arguments: password=secret123, api_key=abc456, title=Bug report
```

Is sanitized in the trace as:
ToolHive sanitizes this in the trace as:

```text
mcp.tool.arguments: password=[REDACTED], api_key=[REDACTED], title=Bug report
gen_ai.tool.call.arguments: password=[REDACTED], api_key=[REDACTED], title=Bug report
```

## Monitoring examples
Expand Down Expand Up @@ -254,15 +322,15 @@ When a client calls the `create_issue` tool:
**Generated trace**:

```text
Span: mcp.tools/call
├── mcp.method: tools/call
├── mcp.request.id: req_456
├── mcp.tool.name: create_issue
├── mcp.tool.arguments: title=Bug report, body=Found an issue with the API
Span: tools/call create_issue
├── mcp.method.name: tools/call
├── jsonrpc.request.id: req_456
├── gen_ai.tool.name: create_issue
├── gen_ai.tool.call.arguments: title=Bug report, body=Found an issue with...
├── mcp.server.name: github
├── mcp.transport: sse
├── http.method: POST
├── http.status_code: 200
├── network.transport: tcp
├── http.request.method: POST
├── http.response.status_code: 200
└── duration: 850ms
```

Expand All @@ -281,10 +349,10 @@ Failed requests generate error traces and metrics:
**Error trace**:

```text
Span: mcp.tools/call
├── mcp.method: tools/call
├── mcp.tool.name: invalid_tool
├── http.status_code: 400
Span: tools/call invalid_tool
├── mcp.method.name: tools/call
├── gen_ai.tool.name: invalid_tool
├── http.response.status_code: 400
├── span.status: ERROR
├── span.status_message: Tool not found
└── duration: 12ms
Expand Down
32 changes: 17 additions & 15 deletions docs/toolhive/guides-cli/telemetry-and-metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ scheme or path (e.g., use `api.honeycomb.io` or `api.honeycomb.io:443`, not

:::

By default, the service name is set to `toolhive-mcp-proxy`, and the sampling
rate is `0.1` (10%). You can customize these settings with additional
[configuration options](#configuration-options).
By default, the service name is set to `thv-<server-name>` (e.g., `thv-fetch`),
and the sampling rate is `0.1` (10%). You can customize these settings with
additional [configuration options](#configuration-options).

:::tip[Recommendation]

Expand Down Expand Up @@ -218,21 +218,23 @@ thv run [--otel-endpoint <URL>] [--otel-service-name <NAME>] \
[--otel-sampling-rate <RATE>] [--otel-headers <KEY=VALUE>] \
[--otel-custom-attributes <KEY=VALUE>] [--otel-env-vars <VAR1,VAR2>] \
[--otel-insecure] [--otel-enable-prometheus-metrics-path] \
[--otel-use-legacy-attributes=<true|false>] \
<SERVER>
```

| Flag | Description | Default |
| --------------------------------------- | ------------------------------------------------------------- | -------------------- |
| `--otel-endpoint` | OTLP endpoint (e.g., `api.honeycomb.io`) | None |
| `--otel-metrics-enabled` | Enable OTLP metrics export (when OTLP endpoint is configured) | `true` |
| `--otel-tracing-enabled` | Enable distributed tracing (when OTLP endpoint is configured) | `true` |
| `--otel-service-name` | Service name for telemetry | `toolhive-mcp-proxy` |
| `--otel-sampling-rate` | Trace sampling rate (0.0-1.0) | `0.1` (10%) |
| `--otel-headers` | Authentication headers in `key=value` format | None |
| `--otel-custom-attributes` | Custom resource attributes in `key=value` format | None |
| `--otel-env-vars` | List of environment variables to include in telemetry spans | None |
| `--otel-insecure` | Connect using HTTP instead of HTTPS | `false` |
| `--otel-enable-prometheus-metrics-path` | Enable `/metrics` endpoint | `false` |
| Flag | Description | Default |
| --------------------------------------- | ------------------------------------------------------------------- | -------------- |
| `--otel-endpoint` | OTLP endpoint (e.g., `api.honeycomb.io`) | None |
| `--otel-metrics-enabled` | Enable OTLP metrics export (when OTLP endpoint is configured) | `true` |
| `--otel-tracing-enabled` | Enable distributed tracing (when OTLP endpoint is configured) | `true` |
| `--otel-service-name` | Service name for telemetry | `thv-<SERVER>` |
| `--otel-sampling-rate` | Trace sampling rate (0.0-1.0) | `0.1` (10%) |
| `--otel-headers` | Authentication headers in `key=value` format | None |
| `--otel-custom-attributes` | Custom resource attributes in `key=value` format | None |
| `--otel-env-vars` | List of environment variables to include in telemetry spans | None |
| `--otel-insecure` | Connect using HTTP instead of HTTPS | `false` |
| `--otel-enable-prometheus-metrics-path` | Enable `/metrics` endpoint | `false` |
| `--otel-use-legacy-attributes` | Emit legacy attribute names alongside new OTel semantic conventions | `true` |

### Global configuration

Expand Down
55 changes: 54 additions & 1 deletion docs/toolhive/guides-vmcp/telemetry-and-metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ spec:
serviceName: 'my-vmcp'
insecure: true
tracingEnabled: true
samplingRate: '0.1'
samplingRate: '0.05'
metricsEnabled: true
enablePrometheusMetricsPath: true
# highlight-end
Expand Down Expand Up @@ -114,6 +114,59 @@ vMCP supports two methods for collecting metrics:
- **Pull via Prometheus**: Set `enablePrometheusMetricsPath: true` to expose a
`/metrics` endpoint on the vMCP service port (4483) for Prometheus to scrape

### Backend metrics

These metrics track requests to individual MCP server backends:

| Metric | Type | Description |
| ----------------------------------------- | --------- | ----------------------------------------------------------------------------- |
| `toolhive_vmcp_backends_discovered` | Gauge | Number of backends discovered |
| `toolhive_vmcp_backend_requests` | Counter | Total requests per backend |
| `toolhive_vmcp_backend_errors` | Counter | Total errors per backend |
| `toolhive_vmcp_backend_requests_duration` | Histogram | Duration of backend requests |
| `mcp.client.operation.duration` | Histogram | MCP client operation duration (`mcp_client_operation_duration` on `/metrics`) |

### Workflow metrics

These metrics track workflow execution across backends:

| Metric | Type | Description |
| ----------------------------------- | --------- | ------------------------------- |
| `toolhive_vmcp_workflow_executions` | Counter | Total workflow executions |
| `toolhive_vmcp_workflow_errors` | Counter | Total workflow execution errors |
| `toolhive_vmcp_workflow_duration` | Histogram | Duration of workflow executions |

### Optimizer metrics

When the vMCP optimizer is enabled, these metrics track tool-finding and
tool-calling performance:

| Metric | Type | Description |
| ----------------------------------------------- | --------- | --------------------------------------- |
| `toolhive_vmcp_optimizer_find_tool_requests` | Counter | Total FindTool calls |
| `toolhive_vmcp_optimizer_find_tool_errors` | Counter | Total FindTool errors |
| `toolhive_vmcp_optimizer_find_tool_duration` | Histogram | Duration of FindTool calls |
| `toolhive_vmcp_optimizer_find_tool_results` | Histogram | Number of tools returned per call |
| `toolhive_vmcp_optimizer_token_savings_percent` | Histogram | Token savings percentage per call |
| `toolhive_vmcp_optimizer_call_tool_requests` | Counter | Total CallTool calls |
| `toolhive_vmcp_optimizer_call_tool_errors` | Counter | Total CallTool errors |
| `toolhive_vmcp_optimizer_call_tool_not_found` | Counter | CallTool calls where tool was not found |
| `toolhive_vmcp_optimizer_call_tool_duration` | Histogram | Duration of CallTool calls |

## Distributed tracing

vMCP creates client-side spans for backend operations with the following span
names:

- `tools/call <tool_name>` - Tool calls to backends
- `resources/read` - Resource reads from backends
- `prompts/get <prompt_name>` - Prompt retrieval from backends
- `list_capabilities` - Backend capability discovery

Each span includes attributes for the target backend (`target.workload_id`,
`target.workload_name`, `target.base_url`) and the relevant MCP attributes
(`mcp.method.name`, `gen_ai.tool.name`, `mcp.resource.uri`).

## Related information

- [Observability concepts](../concepts/observability.mdx) - Overview of
Expand Down