Skip to content

Comments

feat(supervisor): compute workload manager#3114

Draft
nicktrn wants to merge 18 commits intomainfrom
feat/compute-workload-manager
Draft

feat(supervisor): compute workload manager#3114
nicktrn wants to merge 18 commits intomainfrom
feat/compute-workload-manager

Conversation

@nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Feb 23, 2026

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

  • Routes VM create, snapshot, delete, and restore through the compute gateway API
  • Wide event logging on create with full timing and context
  • Configurable gateway timeout, auth token, image digest stripping
  • Restore sends name, env override metadata, CPU and memory so the agent can inject them before the VM resumes

Supervisor wiring (apps/supervisor/src/index.ts)

  • Compute mode activated when gateway URL is configured
  • Restore branch derives a unique runnerId per restore cycle, matching iceman's convention
  • Suspend/restore gated behind snapshots enabled flag

Workload server (apps/supervisor/src/workloadServer/index.ts)

  • Suspend handler triggers a compute snapshot (fire-and-forget) when in compute mode with snapshots enabled
  • Snapshot-complete callback endpoint receives the snapshot ID and calls submitSuspendCompletion

Env validation (apps/supervisor/src/env.ts)

  • Compute gateway URL, auth token, and timeout settings
  • Snapshots enabled flag (defaults off — compute mode can run without checkpoints)
  • Metadata URL required when snapshots enabled (validated at startup)

Add a third WorkloadManager implementation that creates sandboxes via
the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch
with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which
takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway
accepted requests but never returned responses. Adds AbortSignal.timeout
(30s) and consolidates all logging into a single structured event per
create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered
log calls at each early return. Adds business context (envId, envType,
orgId, projectId, deploymentVersion, machine) and instanceName to the
event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.)
as metadata on CreateSandboxRequest instead of relying on env vars.
This enables wide event logging in the compute stack without parsing
env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the
CreateSandboxRequest so the compute stack can use them for admission
control and resource allocation.
Thread timing context from queue consumer through to the compute
workload manager's wide event:

- dequeueResponseMs: platform dequeue HTTP round-trip
- pollingIntervalMs: which polling interval was active (idle vs active)
- warmStartCheckMs: warm start check duration

All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances
- Pass name: runnerId when creating compute instances
- Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager
- Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer
- Handle suspend requests in compute mode via fire-and-forget snapshot with callback
- Handle restore in compute mode by calling gateway restore API directly
- Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request

Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.

runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode.
When disabled, VMs won't receive the metadata URL and suspend/restore
are no-ops. Defaults to off so compute mode can be used without snapshots.
@changeset-bot
Copy link

changeset-bot bot commented Feb 23, 2026

🦋 Changeset detected

Latest commit: 7ed9221

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 28 packages
Name Type
trigger.dev Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@trigger.dev/build Patch
@trigger.dev/core Patch
@trigger.dev/python Patch
@trigger.dev/react-hooks Patch
@trigger.dev/redis-worker Patch
@trigger.dev/rsc Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

Adds a ComputeWorkloadManager and integrates it into the supervisor and workload server to support remote workload creation, snapshot, restore, and deletion via a compute gateway. Introduces compute-related environment vars and conditional validation for snapshots. Extends workload flows to attempt compute-based restore/snapshot and to handle compute snapshot callbacks. Propagates timing context (dequeueResponseMs, pollingIntervalMs, warmStartCheckMs) through queue consumer/session/events/types into workload creation events. Adjusts CLI image output to treat local builds with load differently and adds a changeset marking a patch release.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: introduction of compute workload manager for task execution routing through a compute gateway.
Description check ✅ Passed The description covers the major changes with clear section headings and implementation details, though some required checklist items from the template are incomplete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
packages/cli-v3/src/deploy/buildImage.ts (1)

1128-1153: Consider tightening the load check to load === true for contract clarity.

load is typed as boolean | undefined, so the truthy check isLocalBuild && load && !push silently falls through to type=image when load is undefined. In current call sites this is fine (callers always pass a resolved boolean), but the implicit contract is easy to violate if a future caller passes isLocalBuild: true without resolving load first, unexpectedly getting type=image output.

♻️ Proposed clarification
-  if (isLocalBuild && load && !push) {
+  if (isLocalBuild && load === true && !push) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cli-v3/src/deploy/buildImage.ts` around lines 1128 - 1153, The
conditional that chooses "type=docker" relies on a truthy check (isLocalBuild &&
load && !push) which allows load to be undefined; change the guard to explicitly
test load === true so the branch only triggers when load is explicitly true.
Locate the build function parameters (isLocalBuild, load, push, imageTag) and
update the if condition (and any related comment if helpful) so it reads
isLocalBuild && load === true && !push, keeping the imageTag handling and
returned outputOptions logic unchanged.
apps/supervisor/src/env.ts (1)

81-85: Consider validating COMPUTE_GATEWAY_AUTH_TOKEN when COMPUTE_GATEWAY_URL is set.

If the compute gateway always requires authentication, add a superRefine check similar to the snapshots/metadata validation. If unauthenticated gateways are a valid use case (e.g., mTLS or network-level auth), this is fine as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/env.ts` around lines 81 - 85, Add a zod superRefine to
the environment schema to enforce that when COMPUTE_GATEWAY_URL is defined,
COMPUTE_GATEWAY_AUTH_TOKEN must also be provided: update the env schema around
the COMPUTE_GATEWAY_URL and COMPUTE_GATEWAY_AUTH_TOKEN entries to call
superRefine (similar to the snapshots/metadata validation) and add an error on
the token field when URL is set but token is missing; keep behavior unchanged if
URL is undefined to support unauthenticated gateways.
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (1)

18-18: Consider extracting the timing type to avoid inline duplication.

The { dequeueResponseMs: number; pollingIntervalMs: number } shape is repeated at Line 18 and Line 26 (and similarly in consumerPool.ts and session.ts). A shared named type would reduce repetition across the dequeue chain.

Example: shared type in types.ts
// In types.ts
export type DequeueTiming = {
  dequeueResponseMs: number;
  pollingIntervalMs: number;
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts` at line 18,
Extract the inline timing shape into a shared exported type (e.g., export type
DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs: number; }) in a
central types file, then replace the inline occurrences with that type in the
function signature for onDequeue (currently onDequeue: (messages:
WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.
apps/supervisor/src/index.ts (1)

229-231: Runner ID derivation looks reasonable but is fragile on format assumptions.

message.run.friendlyId.replace("run_", "") uses a plain string replace (only first occurrence), and checkpoint.id.slice(-8) assumes IDs are always ≥8 characters. Both assumptions are likely safe given the platform's ID format, but consider adding a brief comment documenting the expected format, or using a more defensive approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/index.ts` around lines 229 - 231, The runner ID
construction is brittle: replace("run_", "") and slice(-8) assume formats;
update code around runIdShort, checkpointSuffix, and runnerId to be defensive by
(1) deriving runIdShort using a check like
message.run.friendlyId.startsWith("run_") ? message.run.friendlyId.substring(4)
: message.run.friendlyId (or use a regex to extract the suffix) and (2)
computing checkpointSuffix by using checkpoint.id.length >= 8 ?
checkpoint.id.slice(-8) : checkpoint.id (or another safe fallback) so you never
slice past bounds; also add a short comment above the logic documenting the
expected friendlyId and checkpoint.id formats.
apps/supervisor/src/workloadManager/compute.ts (1)

72-78: create() duplicates the authHeaders getter logic — use this.authHeaders instead.

The inline header construction at Lines 72–78 is identical to the private get authHeaders getter defined at Lines 168–176. All other methods (snapshot, deleteInstance, restore) correctly use this.authHeaders.

♻️ Proposed fix
-    const headers: Record<string, string> = {
-      "Content-Type": "application/json",
-    };
-
-    if (this.opts.gatewayAuthToken) {
-      headers["Authorization"] = `Bearer ${this.opts.gatewayAuthToken}`;
-    }
-
     // Strip image digest — resolve by tag, not digest
     const imageRef = opts.image.split("@")[0]!;
 
     const url = `${this.opts.gatewayUrl}/api/instances`;
 
     ...
 
       const [fetchError, response] = await tryCatch(
         fetch(url, {
           method: "POST",
-          headers,
+          headers: this.authHeaders,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadManager/compute.ts` around lines 72 - 78, The
create() method duplicates header construction that already exists in the
private getter authHeaders; replace the inline headers block in create() with a
single reference to this.authHeaders so it reuses the centralized logic (similar
to snapshot, deleteInstance, restore). Locate create() in compute.ts, remove the
manual headers Record<string,string> and the conditional Authorization
assignment, and use this.authHeaders wherever headers are needed to avoid
duplication.
apps/supervisor/src/workloadServer/index.ts (1)

454-499: 200 response is sent only after all network calls complete — gateway retries on slow ops could cause duplicate submissions.

submitSuspendCompletion and deleteInstance are both awaited before reply.empty(200) at Line 499. If either call is slow, the gateway's outbound callback will stall waiting for the acknowledgement; if it has its own timeout/retry policy it will re-deliver the callback, potentially triggering duplicate submitSuspendCompletion calls and a second deleteInstance attempt on the same (possibly already-deleted) instance.

Consider responding 200 immediately after validating the body and processing the callback asynchronously, relying on idempotency guarantees in submitSuspendCompletion, or ensuring submitSuspendCompletion is idempotent for the same (runId, snapshotId) pair.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 454 - 499, The
handler currently awaits submitSuspendCompletion and deleteInstance before
calling reply.empty(200), which can block the gateway and cause duplicate
callbacks; change flow to validate the incoming body, call reply.empty(200)
immediately, then process the suspend completion asynchronously (e.g. spawn a
background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 226-251: The handler currently logs failures from
computeManager.restore (didRestore === false or caught exceptions in the
try/catch around computeManager.restore) and returns without remediation; update
it to notify the platform before returning by calling the worker session HTTP
API (use this.workerSession.httpClient) to either mark the run failed or request
rescheduling, sending run id (message.run.id/friendlyId), runnerId,
checkpoint.id/location and the error details (or a failure reason when
didRestore is false); invoke this notification in both the didRestore === false
branch and the catch block and swallow/handle any HTTP errors locally so the
handler never silently abandons the run.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 433-501: The endpoint registered at
"/api/v1/compute/snapshot-complete" (handler for ComputeSnapshotCallbackBody)
lacks authentication allowing arbitrary suspend submissions and instance
deletions; fix by validating an immutable callback secret/signature before
processing: when generating callback URLs embed a per-run or global callback
token, then in this handler verify a signed header or Bearer token (e.g.,
X-Callback-Signature or Authorization) and reject requests with reply.empty(401)
if verification fails; additionally, once authenticated, cross-check the
provided body.instance_id against the expected instance for the runId (look up
the canonical instance for runId before calling
this.computeManager?.deleteInstance) and only call computeManager.deleteInstance
after both signature and instance match succeed; keep logging but avoid acting
on unauthenticated or mismatched requests and return appropriate HTTP codes.
- Around line 270-272: The callback URL construction in workloadServer/index.ts
uses a fallback to "localhost" which breaks external gateways; update validation
and usage so TRIGGER_WORKLOAD_API_DOMAIN is required when
COMPUTE_SNAPSHOTS_ENABLED is true (similar to TRIGGER_METADATA_URL), and remove
the silent default to "localhost" in the callbackUrl assembly. Specifically,
add/adjust env.ts validation to mark TRIGGER_WORKLOAD_API_DOMAIN as mandatory
when COMPUTE_SNAPSHOTS_ENABLED is enabled (or throw early), and change the code
that builds callbackUrl (the const callbackUrl) to fail loudly if
TRIGGER_WORKLOAD_API_DOMAIN is missing rather than using "localhost".
- Around line 266-288: Return 202 is sent before awaiting
computeManager.snapshot, and if snapshotResult is false there is no recovery
path; after the existing snapshotResult check in the block that calls
computeManager.snapshot({ runnerId, callbackUrl, metadata: { runId:
params.runFriendlyId, snapshotFriendlyId: params.snapshotFriendlyId }}), call
workerClient.submitSuspendCompletion (or equivalent API) to notify the platform
of the suspend failure including runnerId,
params.runFriendlyId/snapshotFriendlyId and error context, log the failure via
this.logger.error, and ensure the submitSuspendCompletion call handles/awaits
errors so the run can be re-queued or failed cleanly when snapshotResult is
false.

In `@packages/core/src/v3/runEngineWorker/supervisor/session.ts`:
- Around line 83-93: This change touched the public package `@trigger.dev/core`
(method onDequeue in supervisor/session.ts), so add a changeset entry for
`@trigger.dev/core`: run pnpm run changeset:add, select the `@trigger.dev/core`
package, describe the change and bump the appropriate version, then commit the
generated changeset file alongside your PR so the release tooling will include
this package.

---

Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 81-85: Add a zod superRefine to the environment schema to enforce
that when COMPUTE_GATEWAY_URL is defined, COMPUTE_GATEWAY_AUTH_TOKEN must also
be provided: update the env schema around the COMPUTE_GATEWAY_URL and
COMPUTE_GATEWAY_AUTH_TOKEN entries to call superRefine (similar to the
snapshots/metadata validation) and add an error on the token field when URL is
set but token is missing; keep behavior unchanged if URL is undefined to support
unauthenticated gateways.

In `@apps/supervisor/src/index.ts`:
- Around line 229-231: The runner ID construction is brittle: replace("run_",
"") and slice(-8) assume formats; update code around runIdShort,
checkpointSuffix, and runnerId to be defensive by (1) deriving runIdShort using
a check like message.run.friendlyId.startsWith("run_") ?
message.run.friendlyId.substring(4) : message.run.friendlyId (or use a regex to
extract the suffix) and (2) computing checkpointSuffix by using
checkpoint.id.length >= 8 ? checkpoint.id.slice(-8) : checkpoint.id (or another
safe fallback) so you never slice past bounds; also add a short comment above
the logic documenting the expected friendlyId and checkpoint.id formats.

In `@apps/supervisor/src/workloadManager/compute.ts`:
- Around line 72-78: The create() method duplicates header construction that
already exists in the private getter authHeaders; replace the inline headers
block in create() with a single reference to this.authHeaders so it reuses the
centralized logic (similar to snapshot, deleteInstance, restore). Locate
create() in compute.ts, remove the manual headers Record<string,string> and the
conditional Authorization assignment, and use this.authHeaders wherever headers
are needed to avoid duplication.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 454-499: The handler currently awaits submitSuspendCompletion and
deleteInstance before calling reply.empty(200), which can block the gateway and
cause duplicate callbacks; change flow to validate the incoming body, call
reply.empty(200) immediately, then process the suspend completion asynchronously
(e.g. spawn a background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.

In `@packages/cli-v3/src/deploy/buildImage.ts`:
- Around line 1128-1153: The conditional that chooses "type=docker" relies on a
truthy check (isLocalBuild && load && !push) which allows load to be undefined;
change the guard to explicitly test load === true so the branch only triggers
when load is explicitly true. Locate the build function parameters
(isLocalBuild, load, push, imageTag) and update the if condition (and any
related comment if helpful) so it reads isLocalBuild && load === true && !push,
keeping the imageTag handling and returned outputOptions logic unchanged.

In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts`:
- Line 18: Extract the inline timing shape into a shared exported type (e.g.,
export type DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs:
number; }) in a central types file, then replace the inline occurrences with
that type in the function signature for onDequeue (currently onDequeue:
(messages: WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a70546 and 5089bba.

📒 Files selected for processing (11)
  • .changeset/fix-local-build-load.md
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
{packages,integrations}/**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Add a changeset when modifying any public package in packages/* or integrations/* using pnpm run changeset:add

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
🧠 Learnings (9)
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to {packages,integrations}/**/* : Add a changeset when modifying any public package in `packages/*` or `integrations/*` using `pnpm run changeset:add`

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Use build extensions in trigger.config.ts (additionalFiles, additionalPackages, aptGet, prismaExtension, etc.) to customize the build

Applied to files:

  • .changeset/fix-local-build-load.md
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure build process in trigger.config.ts using `build` object with external packages, extensions, and JSX settings

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Run `npx trigger.devlatest dev` to start the Trigger.dev development server

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-14T19:24:39.536Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 2685
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.settings/route.tsx:1234-1257
Timestamp: 2025-11-14T19:24:39.536Z
Learning: In the trigger.dev project, version validation for the `useNativeBuildServer` setting cannot be performed at the settings form level because the SDK version is only known at build/deployment time, not when saving project settings.

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-06-04T16:02:22.957Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2150
File: apps/supervisor/src/workloadManager/docker.ts:115-115
Timestamp: 2025-06-04T16:02:22.957Z
Learning: In the Trigger.dev codebase, the supervisor component uses DOCKER_ENFORCE_MACHINE_PRESETS while the docker provider component uses ENFORCE_MACHINE_PRESETS. These are separate components with separate environment variable configurations for the same logical concept of enforcing machine presets.

Applied to files:

  • apps/supervisor/src/env.ts
🧬 Code graph analysis (4)
packages/core/src/v3/runEngineWorker/supervisor/session.ts (1)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • WorkerApiDequeueResponseBody (73-73)
  • WorkerApiDequeueResponseBody (74-74)
apps/supervisor/src/index.ts (3)
apps/supervisor/src/workloadManager/compute.ts (1)
  • ComputeWorkloadManager (17-311)
packages/core/src/v3/serverOnly/checkpointClient.ts (1)
  • CheckpointClient (16-122)
apps/supervisor/src/env.ts (1)
  • env (164-164)
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • WorkerApiDequeueResponseBody (73-73)
  • WorkerApiDequeueResponseBody (74-74)
packages/core/src/v3/runEngineWorker/supervisor/types.ts (1)
  • PreDequeueFn (12-15)
apps/supervisor/src/workloadManager/compute.ts (2)
apps/supervisor/src/workloadManager/types.ts (3)
  • WorkloadManagerOptions (3-14)
  • WorkloadManager (16-18)
  • WorkloadManagerCreateOptions (20-42)
apps/supervisor/src/util.ts (1)
  • getRunnerId (17-25)
🔇 Additional comments (13)
packages/cli-v3/src/deploy/buildImage.ts (2)

201-209: LGTM.

isLocalBuild: false correctly gates the new type=docker path out of remote builds. Since push: true is hardcoded and depot uses --save, the load option is intentionally a no-op here.


535-543: LGTM.

load and push are already resolved to boolean via shouldLoad/shouldPush, so the truthy check in getOutputOptions is reliable here.

.changeset/fix-local-build-load.md (1)

1-5: LGTM!

Changeset correctly documents a patch-level fix for the --load flag issue.

apps/supervisor/src/env.ts (1)

154-162: Good conditional validation.

The superRefine correctly enforces TRIGGER_METADATA_URL as a dependency when COMPUTE_SNAPSHOTS_ENABLED is true. The error is scoped to the right path for clear diagnostics.

packages/core/src/v3/runEngineWorker/supervisor/events.ts (1)

9-10: LGTM!

Optional timing fields are cleanly added to the event type, maintaining backward compatibility.

apps/supervisor/src/workloadManager/types.ts (1)

27-30: LGTM! Clean addition of timing context fields for observability.

Nit: The file uses interface throughout, but coding guidelines prefer type. Since this is pre-existing, not worth changing in this PR.

packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts (1)

354-359: LGTM!

Timing parameter is correctly threaded through the consumer pool's interceptor to the original handler.

apps/supervisor/src/index.ts (2)

282-284: Good use of performance.now() for high-resolution timing.

Clean pattern for measuring warm-start duration.


83-98: Compute mode initialization is clean.

The triple assignment pattern (create → store as computeManager → assign as workloadManager) is clear. The isComputeMode flag derived from the URL's presence is a good single source of truth.

packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)

116-127: LGTM! Timing instrumentation is correctly scoped.

dequeueResponseMs measures only the HTTP call, and pollingIntervalMs captures the interval that scheduled this dequeue cycle. The lastScheduledIntervalMs tracking across scheduleNextDequeue → next dequeue() call is sound.


33-33: lastScheduledIntervalMs tracking is correct.

Initialized to idleIntervalMs (first poll starts from idle state), then updated in scheduleNextDequeue before the callback fires. The value read in the next dequeue() cycle accurately reflects the delay used.

Also applies to: 42-42, 148-148

apps/supervisor/src/workloadManager/compute.ts (2)

120-120: Memory field name and unit differ between create and restore — verify gateway API contracts.

  • create (Line 120): memory_gb: opts.machine.memory — sends machine memory as-is, using field name memory_gb
  • restore (Line 273): memory_mb: opts.machine.memory * 1024 — multiplies by 1024, using field name memory_mb

If opts.machine.memory is in GiB (e.g., 2 = 2 GiB), the create endpoint receives 2 while the restore endpoint receives 2048. Confirm this asymmetry is intentional per the gateway's API schemas for /api/instances vs /api/snapshots/{id}/restore.

Also applies to: 273-273


9-9: No action required. The import is correct.

import { tryCatch } from "@trigger.dev/core" is a valid import. The root index.ts explicitly re-exports tryCatch from utils.ts, making it an intentional root-level export. This does not violate the subpath-only guideline.

…nabled

Remove the silent `localhost` fallback for the snapshot callback URL,
which would be unreachable from external compute gateways. Add env
validation and a runtime guard matching the existing metadata URL pattern.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
apps/supervisor/src/env.ts (1)

154-169: Good cross-field validation — consider also requiring COMPUTE_GATEWAY_URL.

The superRefine correctly gates TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN behind COMPUTE_SNAPSHOTS_ENABLED. However, COMPUTE_GATEWAY_URL is not validated here. If someone sets COMPUTE_SNAPSHOTS_ENABLED=true without COMPUTE_GATEWAY_URL, the env validation passes (requiring the metadata/domain vars) but no ComputeWorkloadManager is instantiated — so snapshots never actually work, with no warning at startup.

Adding a third check would make the fail-fast behavior consistent:

Suggested addition
   .superRefine((data, ctx) => {
+    if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.COMPUTE_GATEWAY_URL) {
+      ctx.addIssue({
+        code: z.ZodIssueCode.custom,
+        message: "COMPUTE_GATEWAY_URL is required when COMPUTE_SNAPSHOTS_ENABLED is true",
+        path: ["COMPUTE_GATEWAY_URL"],
+      });
+    }
     if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.TRIGGER_METADATA_URL) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/env.ts` around lines 154 - 169, The superRefine validator
currently requires TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.
apps/supervisor/src/workloadServer/index.ts (1)

439-507: Callback handler logic is sound; one defensive improvement to consider.

The completed/failed branching, metadata validation, and suspend-completion submission all look correct. One thing to note: if submitSuspendCompletion succeeds but deleteInstance fails (Line 478), the failure is silently swallowed — deleteInstance returns a boolean and logs internally, but the callback handler doesn't log the deletion outcome. This could make debugging orphaned instances harder.

Optional: log deletion failure
-            await this.computeManager?.deleteInstance(body.instance_id);
+            const deleted = await this.computeManager?.deleteInstance(body.instance_id);
+            if (deleted === false) {
+              this.logger.warn("Failed to delete compute instance after suspend completion", {
+                runId,
+                instanceId: body.instance_id,
+              });
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 439 - 507, The
handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 266-273: The error response in the branch where
this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED and
env.TRIGGER_WORKLOAD_API_DOMAIN is missing returns { error: "..." } which
doesn't match WorkloadSuspendRunResponseBody; change the reply.json call in that
branch (the block referencing this.computeManager,
env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return the same shape as the
other handlers: { ok: false, error: "TRIGGER_WORKLOAD_API_DOMAIN is not set,
cannot create snapshot callback URL" } (or equivalent message) so the response
conforms to WorkloadSuspendRunResponseBody and the runner can parse it.

---

Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 154-169: The superRefine validator currently requires
TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 439-507: The handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5089bba and 7ed9221.

📒 Files selected for processing (2)
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
🧠 Learnings (4)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
📚 Learning: 2026-02-23T12:56:44.997Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
🧬 Code graph analysis (2)
apps/supervisor/src/workloadServer/index.ts (2)
apps/supervisor/src/workloadManager/compute.ts (1)
  • ComputeWorkloadManager (17-311)
packages/core/src/v3/runEngineWorker/workload/schemas.ts (2)
  • WorkloadSuspendRunResponseBody (22-30)
  • WorkloadSuspendRunResponseBody (31-31)
apps/supervisor/src/env.ts (2)
packages/core/src/v3/utils/crypto.ts (1)
  • randomUUID (1-5)
apps/supervisor/src/envUtil.ts (2)
  • BoolEnv (15-17)
  • AdditionalEnvVars (19-47)
🔇 Additional comments (4)
apps/supervisor/src/env.ts (1)

81-85: LGTM — new compute settings are well-structured.

Clean schema additions with sensible defaults (30_000 ms timeout, snapshots off by default).

apps/supervisor/src/workloadServer/index.ts (3)

27-27: LGTM — type-only import for ComputeWorkloadManager.

Correctly uses import type to avoid runtime dependency on the compute module.


57-63: LGTM — ComputeSnapshotCallbackBody schema is well-defined.

Uses z.object with appropriate optional fields and a bounded string enum for status.


275-294: Fire-and-forget pattern looks correct for the compute snapshot path.

The 202 is returned before the async snapshot() call, and the failure is logged. Recovery for failed snapshots is acknowledged as a future enhancement per prior discussion.

Comment on lines +266 to +273
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Response at Line 271 doesn't conform to WorkloadSuspendRunResponseBody.

The other error responses in this handler (Lines 255–262, 297–304) correctly return { ok: false, error: "..." } satisfies WorkloadSuspendRunResponseBody. This branch returns { error: "..." } without the ok discriminator, so the runner may fail to parse the response.

Proposed fix
-              reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
+              reply.json(
+                {
+                  ok: false,
+                  error: "Snapshot callbacks not configured",
+                } satisfies WorkloadSuspendRunResponseBody,
+                false,
+                500
+              );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
return;
}
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json(
{
ok: false,
error: "Snapshot callbacks not configured",
} satisfies WorkloadSuspendRunResponseBody,
false,
500
);
return;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 266 - 273, The
error response in the branch where this.computeManager &&
env.COMPUTE_SNAPSHOTS_ENABLED and env.TRIGGER_WORKLOAD_API_DOMAIN is missing
returns { error: "..." } which doesn't match WorkloadSuspendRunResponseBody;
change the reply.json call in that branch (the block referencing
this.computeManager, env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return
the same shape as the other handlers: { ok: false, error:
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" }
(or equivalent message) so the response conforms to
WorkloadSuspendRunResponseBody and the runner can parse it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant