✨ feat: add --no-sandbox to run without the gVisor runtime#14
Conversation
agentd unconditionally required `nix` and `runsc` (gVisor) at startup and treated a failed tmux start as fatal, so it could not come up anywhere those are unavailable — e.g. a minimal capstan microVM where the VM boundary is the sandbox and there is no nix/gVisor or sudo/agent user. `--no-sandbox` skips the runsc/nix initialization; sandboxed agents then fail to launch (guarded), while the API and native/tmux agents are unaffected. tmux startup becomes best-effort and non-fatal for the same reason: agentd still serves its control-plane API, and native agents simply can't launch until a tmux server and the agent user are available. Both behaviors are opt-in — default startup is unchanged. Part of PCC-777.
|
| Filename | Overview |
|---|---|
| agentd/agentd.go | Adds noSandbox field and SetNoSandbox() method; gates gVisor/nix init on the flag; makes tmux startup non-fatal unconditionally (not limited to --no-sandbox). Sandboxed agent launch is correctly guarded by a nil-runner check in createManager. |
| agentd/agentd_test.go | Adds a minimal no-panic spec for SetNoSandbox, consistent with the existing test style in the file. |
| main.go | Adds --no-sandbox flag wired to daemon.SetNoSandbox(true); placement and conditional are consistent with other flags. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[agentd Run] --> B{noSandbox?}
B -- yes --> C[Log: sandbox disabled\nskip runsc + nix checks]
B -- no --> D[LookPath nix]
D --> E[sandbox.NewRunner]
E --> F[runner.Cleanup]
C --> G[tmux.Start]
F --> G
G -- success --> H[API server start]
G -- failure --> I[Log warning\nnative agents unavailable]
I --> H
H --> J[reconcileLoop]
J --> K{agent type?}
K -- sandboxed + runner==nil --> L[Error: runner not initialized]
K -- sandboxed + runner ok --> M[sandbox.NewManager]
K -- native --> N[native.NewManager\nwith tmux ref]
M --> O[mgr.Start]
N --> O
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[agentd Run] --> B{noSandbox?}
B -- yes --> C[Log: sandbox disabled\nskip runsc + nix checks]
B -- no --> D[LookPath nix]
D --> E[sandbox.NewRunner]
E --> F[runner.Cleanup]
C --> G[tmux.Start]
F --> G
G -- success --> H[API server start]
G -- failure --> I[Log warning\nnative agents unavailable]
I --> H
H --> J[reconcileLoop]
J --> K{agent type?}
K -- sandboxed + runner==nil --> L[Error: runner not initialized]
K -- sandboxed + runner ok --> M[sandbox.NewManager]
K -- native --> N[native.NewManager\nwith tmux ref]
M --> O[mgr.Start]
N --> O
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
agentd/agentd.go:273-280
**Unconditional tmux non-fatal change affects all deployments**
The tmux startup failure is now silently downgraded for every agentd invocation, not only when `--no-sandbox` is active. In a standard stereOS deployment where `type = "native"` agents are configured, a broken agent user or missing tmux binary no longer causes a visible startup failure — agentd starts, the reconcile loop runs, and every native agent logs a per-cycle `"error starting agent"` error indefinitely, while the root cause (tmux never started) is only printed once at boot. Previously this was a hard startup failure via `return fmt.Errorf(...)`.
Consider gating the non-fatal path on `d.noSandbox`, so standard sandboxed deployments retain the original fail-fast behavior and only the capstan control-plane path is lenient.
```suggestion
if err := d.tmux.Start(); err != nil {
if d.noSandbox {
// Non-fatal: agentd still serves its API and status. Native (tmux)
// agents need a working tmux server + the agent user, so they cannot
// launch until that's available; sandboxed agents are unaffected. This
// lets agentd run in a minimal environment (e.g. a capstan VM with no
// sudo/agent user) for control-plane bring-up.
log.Printf("agentd: warning: tmux server unavailable (%v); native agents cannot launch", err)
} else {
return fmt.Errorf("starting tmux server: %w", err)
}
}
```
Reviews (1): Last reviewed commit: "✨ feat: add --no-sandbox to run without ..." | Re-trigger Greptile
| if err := d.tmux.Start(); err != nil { | ||
| return fmt.Errorf("starting tmux server: %w", err) | ||
| // Non-fatal: agentd still serves its API and status. Native (tmux) | ||
| // agents need a working tmux server + the agent user, so they cannot | ||
| // launch until that's available; sandboxed agents are unaffected. This | ||
| // lets agentd run in a minimal environment (e.g. a capstan VM with no | ||
| // sudo/agent user) for control-plane bring-up. | ||
| log.Printf("agentd: warning: tmux server unavailable (%v); native agents cannot launch", err) | ||
| } |
There was a problem hiding this comment.
Unconditional tmux non-fatal change affects all deployments
The tmux startup failure is now silently downgraded for every agentd invocation, not only when --no-sandbox is active. In a standard stereOS deployment where type = "native" agents are configured, a broken agent user or missing tmux binary no longer causes a visible startup failure — agentd starts, the reconcile loop runs, and every native agent logs a per-cycle "error starting agent" error indefinitely, while the root cause (tmux never started) is only printed once at boot. Previously this was a hard startup failure via return fmt.Errorf(...).
Consider gating the non-fatal path on d.noSandbox, so standard sandboxed deployments retain the original fail-fast behavior and only the capstan control-plane path is lenient.
| if err := d.tmux.Start(); err != nil { | |
| return fmt.Errorf("starting tmux server: %w", err) | |
| // Non-fatal: agentd still serves its API and status. Native (tmux) | |
| // agents need a working tmux server + the agent user, so they cannot | |
| // launch until that's available; sandboxed agents are unaffected. This | |
| // lets agentd run in a minimal environment (e.g. a capstan VM with no | |
| // sudo/agent user) for control-plane bring-up. | |
| log.Printf("agentd: warning: tmux server unavailable (%v); native agents cannot launch", err) | |
| } | |
| if err := d.tmux.Start(); err != nil { | |
| if d.noSandbox { | |
| // Non-fatal: agentd still serves its API and status. Native (tmux) | |
| // agents need a working tmux server + the agent user, so they cannot | |
| // launch until that's available; sandboxed agents are unaffected. This | |
| // lets agentd run in a minimal environment (e.g. a capstan VM with no | |
| // sudo/agent user) for control-plane bring-up. | |
| log.Printf("agentd: warning: tmux server unavailable (%v); native agents cannot launch", err) | |
| } else { | |
| return fmt.Errorf("starting tmux server: %w", err) | |
| } | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentd/agentd.go
Line: 273-280
Comment:
**Unconditional tmux non-fatal change affects all deployments**
The tmux startup failure is now silently downgraded for every agentd invocation, not only when `--no-sandbox` is active. In a standard stereOS deployment where `type = "native"` agents are configured, a broken agent user or missing tmux binary no longer causes a visible startup failure — agentd starts, the reconcile loop runs, and every native agent logs a per-cycle `"error starting agent"` error indefinitely, while the root cause (tmux never started) is only printed once at boot. Previously this was a hard startup failure via `return fmt.Errorf(...)`.
Consider gating the non-fatal path on `d.noSandbox`, so standard sandboxed deployments retain the original fail-fast behavior and only the capstan control-plane path is lenient.
```suggestion
if err := d.tmux.Start(); err != nil {
if d.noSandbox {
// Non-fatal: agentd still serves its API and status. Native (tmux)
// agents need a working tmux server + the agent user, so they cannot
// launch until that's available; sandboxed agents are unaffected. This
// lets agentd run in a minimal environment (e.g. a capstan VM with no
// sudo/agent user) for control-plane bring-up.
log.Printf("agentd: warning: tmux server unavailable (%v); native agents cannot launch", err)
} else {
return fmt.Errorf("starting tmux server: %w", err)
}
}
```
How can I resolve this? If you propose a fix, please make it concise.The daggerverse ghcontrib module now requires dagger v0.20.8, so the PR checks (title conformance, Linear magic word) error out under the pinned v0.20.6 with "module requires dagger v0.20.8, but you have v0.20.6".
Summary
agentd unconditionally required
nixandrunsc(gVisor) at startup and treateda failed tmux start as fatal, so it could not come up where those are
unavailable — e.g. a minimal capstan microVM where the VM boundary is the
sandbox (no nix/gVisor, no sudo/agent user).
--no-sandboxskips the runsc/nix initialization. Sandboxed agents then failto launch (guarded); the API and native/tmux agents are unaffected.
control-plane API, and native agents simply can't launch until a tmux server
and the agent user are available.
Both are opt-in — default startup is unchanged.
Context: lets agentd come up for control-plane bring-up inside a capstan microVM
(a minimal Rust PID 1 for microVM sandboxes).
Test plan
gofmt/make lint(go vet ./...) cleanmake test— all suites pass; added a spec forSetNoSandbox--no-sandbox(nonix/gVisor present)
Part of PCC-777.