Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .changeset/otel-linked-trace-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
'@workflow/core': minor
'workflow': minor
'@workflow/world-vercel': minor
'@workflow/utils': minor
---

Add `WORKFLOW_TRACE_MODE` with a new `linked` default: each workflow/step invocation span is now its own trace root with span links to the delivery and run-origin contexts, instead of one trace spanning the entire run. world-vercel now explicitly injects W3C `traceparent`/`tracestate`/`baggage` headers on outgoing workflow-server requests.

Span names are also friendlier: workflow and step spans now use the short function name (e.g. `workflow.execute processOrder`, `step.execute chargeCard`, `workflow.start processOrder`) instead of the uppercase prefixes and full machine names (`WORKFLOW_V2 workflow//./src/jobs/order//processOrder`). The full name remains available in the `workflow.name` / `step.name` span attributes, and new `workflowDisplayName` / `stepDisplayName` helpers are exported from `@workflow/utils`.

Behavioral changes to telemetry under the new default (set `WORKFLOW_TRACE_MODE=continuous` to restore the previous trace shape exactly; the span-name change applies in both modes):

- A run no longer shares one trace ID: the trace of the request that called `start()` no longer contains the workflow's execution spans — navigate via span links or the `workflow.run.id` attribute instead.
- Sampling decisions are made independently per invocation root (previously one parent-based decision covered the whole run), and the number of root spans/traces increases to one per invocation.
- `workflow.execute`/`step.execute` invocation spans (formerly `WORKFLOW_V2`/`STEP`) become parentless roots, which changes parent/child-based queries and service-map edges.
- Re-enqueued queue messages forward the original run-origin trace carrier unchanged, rather than each invocation's current context.
- Queries or dashboards matching the old `WORKFLOW_V2 ...`/`STEP ...` span names must switch to the new names.
- The queue-delivered `workflow.execute` span kind changed from `internal` to `consumer`, matching the queue-delivered `step.execute` span (this applies in both modes).

Existing attributes and baggage keys are unchanged, and everything remains a no-op when no OpenTelemetry SDK is registered.
3 changes: 3 additions & 0 deletions docs/content/docs/v5/observability/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ When deployed to Vercel, workflow data is [encrypted end-to-end](/docs/how-it-wo
## More Observability Features

<Cards>
<Card href="/docs/observability/tracing" title="Tracing">
Distributed tracing with OpenTelemetry for workflow runs, steps, and queue deliveries.
</Card>
<Card href="/docs/observability/attributes" title="Attributes">
Attach experimental metadata to workflow runs for observability.
</Card>
Expand Down
5 changes: 4 additions & 1 deletion docs/content/docs/v5/observability/meta.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
{
"title": "Observability",
"pages": ["attributes"]
"pages": [
"tracing",
"attributes"
]
}
106 changes: 106 additions & 0 deletions docs/content/docs/v5/observability/tracing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
title: Tracing
description: Distributed tracing with OpenTelemetry for workflow runs, steps, and queue deliveries.
type: guide
summary: Trace workflow execution end to end with OpenTelemetry.
prerequisites:
- /docs/foundations/workflows-and-steps
related:
- /docs/observability
- /docs/observability/attributes
- /docs/how-it-works/event-sourcing
---

The Workflow SDK is instrumented with [OpenTelemetry](https://opentelemetry.io) out of the box. It emits spans for workflow starts, every workflow and step invocation, and the HTTP calls it makes to the workflow backend — and it propagates trace context across queue deliveries so a run remains traceable end to end.

The SDK only depends on the OpenTelemetry **API**, never on an SDK or exporter. If your application does not register an OpenTelemetry SDK, all tracing code is a silent no-op with no overhead and no behavior change.

## Enabling tracing

Register any OpenTelemetry Node SDK in your application. On Vercel with Next.js, the simplest setup is [`@vercel/otel`](https://vercel.com/docs/observability/otel-overview) in `instrumentation.ts`:

```typescript title="instrumentation.ts" lineNumbers
import { registerOTel } from "@vercel/otel"

export function register() {
registerOTel({ serviceName: "my-app" })
}
```

No workflow-specific configuration is required. As soon as a tracer provider and propagator are registered, the SDK's spans, context propagation, and span links activate automatically.

## Spans

| Span name | Kind | Emitted when |
| --- | --- | --- |
| `workflow.start <name>` | internal | `start()` is called in your application code |
| `workflow.execute <name>` | consumer (root) | a queue delivery invokes the workflow — replay, orchestration, and inline steps run under it |
| `step.execute <name>` | internal (inline) / consumer + root (queue-delivered) | a step function executes |
| `http <method>` | client | the SDK calls the workflow backend (event reads/writes) |

`<name>` is the short function name (for example `processOrder`); the full machine name, including the source module, is available in the `workflow.name` / `step.name` attributes.

## Key attributes

| Attribute | Description |
| --- | --- |
| `workflow.run.id` | The run ID (`wrun_...`). Present on every workflow and step span — the primary key for finding all spans of a run. |
| `workflow.name` | The workflow function name. |
| `workflow.trace.mode` | The active trace mode (`linked` or `continuous`). |
| `workflow.trace.propagated` | Whether the invocation received trace context from the queue message. |
| `workflow.queue.overhead_ms` | Time between the message being enqueued and the handler starting — queue dwell plus any cold start. |

## Trace shape: one trace per invocation

A single workflow run can span hours or days across many separate function invocations: every step completion, `sleep()` wake-up, and retry is a new queue delivery. Stitching all of that into one trace produces giant, slow-loading traces that most tracing backends truncate.

Instead, the SDK creates **one bounded trace per invocation**. Each `workflow.execute` (or background `step.execute`) span starts a new trace root and attaches two **span links**:

- a link to the **enqueue site** — the span that queued the message which triggered this invocation, and
- a link to the **run origin** — the trace in which `start()` was originally called.

A span link is OpenTelemetry's relationship for "causally related, but in a different trace." It is the standard pattern for asynchronous messaging, where producing and consuming a message can be separated by arbitrary time.

```mermaid
flowchart LR
O["start() request trace"]
A["invocation 1"]
B["invocation 2"]
C["invocation 3 ..."]
A -. "link" .-> O
B -. "link" .-> O
C -. "link" .-> O
B -. "link" .-> A
C -. "link" .-> B

style O fill:#a78bfa,stroke:#8b5cf6,color:#000
```

Each invocation links back to the trace that enqueued it and to the run origin.

To see a whole run, query by attribute rather than by trace ID — for example `workflow.run.id = wrun_...` in your tracing backend — or follow the span links between invocation traces.

## Trace modes

The `WORKFLOW_TRACE_MODE` environment variable controls the shape:

| Mode | Behavior |
| --- | --- |
| `linked` (default) | Each invocation is its own trace root with span links to the enqueue site and the run origin. Traces stay small; sampling is decided per invocation. |
| `continuous` | The run-origin context becomes the **parent** of every invocation, so the entire run shares one trace ID. |

<Callout type="warn">
This is a behavior change from v4, which always used `continuous`-style tracing. If you have dashboards or queries that assume one trace ID per run, either update them to use `workflow.run.id` and span links, or set `WORKFLOW_TRACE_MODE=continuous` to restore the previous shape. Note that in `linked` mode each invocation root makes its own sampling decision, and the number of root spans increases to one per invocation.
</Callout>

## Context propagation

When tracing is enabled, the SDK propagates [W3C Trace Context](https://www.w3.org/TR/trace-context/) on its outbound calls:

- **Backend requests** carry `traceparent`, `tracestate`, and `baggage` headers, so backend spans can join your trace.
- **Queue messages** carry the run-origin trace context in the message payload, and the queue re-delivers the producer's context to the workflow handler, where it becomes the enqueue-site span link.
- **Baggage** carries `workflow.run_id` and `workflow.name` entries during workflow execution, allowing downstream services you call from steps to tag their own telemetry with the run ID.

<Callout>
Baggage entries set by your application are propagated as a `baggage` HTTP header on the SDK's backend requests, like any other OpenTelemetry-instrumented HTTP call. Avoid placing sensitive values in baggage.
</Callout>
3 changes: 3 additions & 0 deletions packages/core/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,9 @@
},
"devDependencies": {
"@opentelemetry/api": "1.9.0",
"@opentelemetry/context-async-hooks": "1.30.1",
"@opentelemetry/core": "1.30.1",
"@opentelemetry/sdk-trace-base": "1.30.1",
"@types/debug": "4.1.12",
"@types/node": "catalog:",
"@types/seedrandom": "3.0.8",
Expand Down
Loading
Loading