feat(scripts): add upgrade-sandbox.sh and non-interactive inference hydration#902
feat(scripts): add upgrade-sandbox.sh and non-interactive inference hydration#902HagegeR wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughPopulate non-interactive onboard provider/model from the sandbox registry, tighten NVIDIA credential resolution, stop persisting inference sandbox metadata, distinguish core vs optional workspace artifacts in backup/restore (skip missing optional items), and add a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Upgrade as "upgrade-sandbox.sh"
participant Backup as "backup-workspace.sh"
participant Openshell
participant Nemoclaw as "nemoclaw CLI"
participant Registry
User->>Upgrade: run upgrade-sandbox.sh [--full-data] [--run-onboard] <sandbox>
Upgrade->>Backup: backup <sandbox>
Backup->>Openshell: download workspace files
Openshell-->>Backup: files + timestamp
Backup-->>Upgrade: backup complete (TS)
alt --full-data
Upgrade->>Openshell: download /sandbox/.openclaw-data/
Openshell-->>Upgrade: data snapshot stored
end
alt --run-onboard
Upgrade->>Nemoclaw: nemoclaw onboard (non-interactive)
Nemoclaw->>Registry: hydrateNonInteractiveInferenceFromRegistry()
Registry-->>Nemoclaw: provider & model (if available)
Nemoclaw->>Openshell: rebuild sandbox image
Openshell-->>Nemoclaw: rebuild complete
Nemoclaw-->>Upgrade: onboard complete
alt restore
Upgrade->>Backup: restore <sandbox> <TS>
Backup->>Openshell: upload backed-up files
Openshell-->>Backup: restore complete
Backup-->>Upgrade: restore done
end
else
Upgrade-->>User: print next-step restore command
end
Upgrade-->>User: upgrade complete
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bin/lib/onboard.js (1)
1686-1705:⚠️ Potential issue | 🟠 MajorReuse saved credentials for all non-interactive remote providers.
This branch now falls back to
getCredential("NVIDIA_API_KEY"), but the non-interactive path foropenai,anthropic,gemini, and the compatible endpoints still aborts unless the key is already exported in the environment. Users who previously onboarded interactively will usually have those keys only in~/.nemoclaw/credentials.json, soscripts/upgrade-sandbox.sh --run-onboardstill fails for those providers.💡 Suggested fix
} else { if (isNonInteractive()) { - if (!process.env[credentialEnv]) { + const key = getCredential(credentialEnv); + if (!key) { console.error(` ${credentialEnv} is required for ${remoteConfig.label} in non-interactive mode.`); process.exit(1); } + process.env[credentialEnv] = key; } else { await ensureNamedCredential(credentialEnv, remoteConfig.label + " API key", remoteConfig.helpUrl); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1686 - 1705, The non-interactive branch for other remote providers currently aborts if process.env[credentialEnv] is missing instead of reusing a saved key from getCredential("..."); update the non-interactive check in the else branch (the block using isNonInteractive(), credentialEnv, and remoteConfig.label) to first call getCredential(credentialEnv) and if that returns a key set process.env[credentialEnv] = key and continue; only call process.exit(1) if neither the environment nor getCredential provides a key; keep the interactive path using ensureNamedCredential unchanged.
🧹 Nitpick comments (2)
docs/workspace/backup-restore.md (2)
101-101: Split this into one sentence per source line.Line 101 packs two sentences onto one line.
As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/workspace/backup-restore.md` at line 101, Split the combined sentence into two separate source lines so each sentence is on its own line: put "Before you change the sandbox image (for example after editing `Dockerfile`) or run `nemoclaw onboard`, back up." on its own line and place "Optional files such as `MEMORY.md` or `memory/` may not exist yet; the backup script skips them with a warning." on the following line to satisfy the one-sentence-per-line guideline.
112-112: Drop the routine bolding and break the sentences apart.
**automatically**is not a UI label or warning, and this line currently contains multiple sentences. LLM pattern detected.As per coding guidelines, "Bold is reserved for UI labels, parameter names, and genuine warnings." and "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/workspace/backup-restore.md` at line 112, Remove the bold formatting around "automatically" and split the existing single-line paragraph into separate sentences each on its own line: describe that `--run-onboard` runs `nemoclaw onboard` from the repo root (or `NEMOCLAW_REPO_ROOT`), state that after a successful onboard the script runs `backup-workspace.sh restore` for the backup timestamp created at the start (remove bolding around "automatically"), and add a separate sentence that `--no-restore` skips that restore so you get a fresh workspace after the rebuild; ensure each sentence is on its own line.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 2246-2247: The registry entry for new sandboxes only stores {
model, provider } so hydrateNonInteractiveInferenceFromRegistry() cannot restore
NEMOCLAW_ENDPOINT_URL required by setupNim() for
compatible-endpoint/compatible-anthropic-endpoint; update the logic after
createSandbox in onboard.js (where createSandbox is called and
registry.updateSandbox is invoked) to also persist the sandbox's endpoint URL
(e.g., a field like nemoclawEndpoint or endpointUrl) and then ensure
hydrateNonInteractiveInferenceFromRegistry() sets
process.env.NEMOCLAW_ENDPOINT_URL from that stored field so setupNim() can run
non-interactively with compatible endpoints.
In `@scripts/backup-workspace.sh`:
- Around line 74-95: The loop currently treats any non-zero exit from
sandbox_download as a benign skip; change logic so core items (FILES_CORE loop
using sandbox_download ... 0) treat any non-zero as fatal: log the
sandbox_download stderr (use warn or processLogger equivalent) and exit non-zero
immediately (do not increment count) so partial backups fail; for optional items
(FILES_OPTIONAL and DIRS_OPTIONAL using sandbox_download ... 1) call
sandbox_download and inspect its exit code/output: if it explicitly indicates
"not found" keep the current quiet skip message and do not increment count, but
if it fails for any other reason surface a clear error (echo/warn with stderr),
mark the backup as failed (set a failure flag or exit non-zero at end) and do
not increment count; update references to sandbox_download, FILES_CORE,
FILES_OPTIONAL, DIRS_OPTIONAL, count, warn, and the optional message branches
accordingly so only true "not found" remains quiet and all other errors are
propagated.
In `@scripts/upgrade-sandbox.sh`:
- Around line 94-100: The current TS lookup is a TOCTOU race (using ls on
BACKUP_BASE) — instead have the backup command itself emit the created
timestamp/path and consume that output instead of rescanning; modify the code
that runs backup-workspace.sh backup to capture its stdout (or have
backup-workspace.sh write the created backup path to a known file) and assign
that value to TS, then validate [ -n "$TS" ] || fail and set
DEST="${BACKUP_BASE}/${TS}" as before; reference the TS variable and the
backup-workspace.sh invocation so you replace the ls-based discovery with the
backup command’s explicit output.
---
Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 1686-1705: The non-interactive branch for other remote providers
currently aborts if process.env[credentialEnv] is missing instead of reusing a
saved key from getCredential("..."); update the non-interactive check in the
else branch (the block using isNonInteractive(), credentialEnv, and
remoteConfig.label) to first call getCredential(credentialEnv) and if that
returns a key set process.env[credentialEnv] = key and continue; only call
process.exit(1) if neither the environment nor getCredential provides a key;
keep the interactive path using ensureNamedCredential unchanged.
---
Nitpick comments:
In `@docs/workspace/backup-restore.md`:
- Line 101: Split the combined sentence into two separate source lines so each
sentence is on its own line: put "Before you change the sandbox image (for
example after editing `Dockerfile`) or run `nemoclaw onboard`, back up." on its
own line and place "Optional files such as `MEMORY.md` or `memory/` may not
exist yet; the backup script skips them with a warning." on the following line
to satisfy the one-sentence-per-line guideline.
- Line 112: Remove the bold formatting around "automatically" and split the
existing single-line paragraph into separate sentences each on its own line:
describe that `--run-onboard` runs `nemoclaw onboard` from the repo root (or
`NEMOCLAW_REPO_ROOT`), state that after a successful onboard the script runs
`backup-workspace.sh restore` for the backup timestamp created at the start
(remove bolding around "automatically"), and add a separate sentence that
`--no-restore` skips that restore so you get a fresh workspace after the
rebuild; ensure each sentence is on its own line.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4025feaf-c8d0-4a4b-9040-562d2f2e53a3
📒 Files selected for processing (4)
bin/lib/onboard.jsdocs/workspace/backup-restore.mdscripts/backup-workspace.shscripts/upgrade-sandbox.sh
bin/lib/onboard.js
Outdated
| const sandboxName = await createSandbox(gpu, model, provider, preferredInferenceApi); | ||
| registry.updateSandbox(sandboxName, { model, provider }); |
There was a problem hiding this comment.
Persist compatible endpoint URLs with the sandbox metadata.
hydrateNonInteractiveInferenceFromRegistry() can only reconstruct what gets stored here. Right now this writes { model, provider }, but setupNim() still requires NEMOCLAW_ENDPOINT_URL for compatible-endpoint and compatible-anthropic-endpoint, so non-interactive upgrades of those sandboxes still exit even though provider/model hydration succeeds.
💡 Suggested fix
- registry.updateSandbox(sandboxName, { model, provider });
+ registry.updateSandbox(sandboxName, {
+ model,
+ provider,
+ endpointUrl:
+ provider === "compatible-endpoint" || provider === "compatible-anthropic-endpoint"
+ ? endpointUrl
+ : undefined,
+ });You’ll also need to hydrate process.env.NEMOCLAW_ENDPOINT_URL from that stored field.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 2246 - 2247, The registry entry for new
sandboxes only stores { model, provider } so
hydrateNonInteractiveInferenceFromRegistry() cannot restore
NEMOCLAW_ENDPOINT_URL required by setupNim() for
compatible-endpoint/compatible-anthropic-endpoint; update the logic after
createSandbox in onboard.js (where createSandbox is called and
registry.updateSandbox is invoked) to also persist the sandbox's endpoint URL
(e.g., a field like nemoclawEndpoint or endpointUrl) and then ensure
hydrateNonInteractiveInferenceFromRegistry() sets
process.env.NEMOCLAW_ENDPOINT_URL from that stored field so setupNim() can run
non-interactively with compatible endpoints.
| for f in "${FILES_CORE[@]}"; do | ||
| if sandbox_download "$sandbox" "${WORKSPACE_PATH}/${f}" "${dest}/" 0; then | ||
| count=$((count + 1)) | ||
| else | ||
| warn "Skipped ${f} (not found or download failed)" | ||
| fi | ||
| done | ||
|
|
||
| for d in "${DIRS[@]}"; do | ||
| if openshell sandbox download "$sandbox" "${WORKSPACE_PATH}/${d}/" "${dest}/${d}/"; then | ||
| for f in "${FILES_OPTIONAL[@]}"; do | ||
| if sandbox_download "$sandbox" "${WORKSPACE_PATH}/${f}" "${dest}/" 1; then | ||
| count=$((count + 1)) | ||
| else | ||
| echo -e "${DIM}[backup]${NC} Optional ${f} not in sandbox — skipped (normal until created)." | ||
| fi | ||
| done | ||
|
|
||
| for d in "${DIRS_OPTIONAL[@]}"; do | ||
| if sandbox_download "$sandbox" "${WORKSPACE_PATH}/${d}/" "${dest}/${d}/" 1; then | ||
| count=$((count + 1)) | ||
| else | ||
| warn "Skipped ${d}/ (not found or download failed)" | ||
| echo -e "${DIM}[backup]${NC} Optional ${d}/ not in sandbox — skipped (normal until created)." | ||
| fi |
There was a problem hiding this comment.
Don't report a partial backup as success.
Every non-zero openshell sandbox download is currently treated as a skip. That means a transient gateway/download failure can leave you with a partial backup that still looks successful, which is risky before destroy or upgrade-sandbox.sh. Core artifacts should fail the run, and optional artifacts should only be downgraded when the remote path is actually absent.
💡 Minimum hardening for the core files
for f in "${FILES_CORE[@]}"; do
if sandbox_download "$sandbox" "${WORKSPACE_PATH}/${f}" "${dest}/" 0; then
count=$((count + 1))
else
- warn "Skipped ${f} (not found or download failed)"
+ fail "Required workspace file ${f} was not backed up."
fi
doneFor the optional paths, keep the quiet UX only for the specific "not found yet" case and surface other download errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/backup-workspace.sh` around lines 74 - 95, The loop currently treats
any non-zero exit from sandbox_download as a benign skip; change logic so core
items (FILES_CORE loop using sandbox_download ... 0) treat any non-zero as
fatal: log the sandbox_download stderr (use warn or processLogger equivalent)
and exit non-zero immediately (do not increment count) so partial backups fail;
for optional items (FILES_OPTIONAL and DIRS_OPTIONAL using sandbox_download ...
1) call sandbox_download and inspect its exit code/output: if it explicitly
indicates "not found" keep the current quiet skip message and do not increment
count, but if it fails for any other reason surface a clear error (echo/warn
with stderr), mark the backup as failed (set a failure flag or exit non-zero at
end) and do not increment count; update references to sandbox_download,
FILES_CORE, FILES_OPTIONAL, DIRS_OPTIONAL, count, warn, and the optional message
branches accordingly so only true "not found" remains quiet and all other errors
are propagated.
| TS="" | ||
| if [ -d "$BACKUP_BASE" ]; then | ||
| # shellcheck disable=SC2012 # backup dirs are YYYYMMDD-HHMMSS — no special chars | ||
| TS="$(ls -1t "$BACKUP_BASE" 2>/dev/null | head -n1 || true)" | ||
| fi | ||
| [ -n "$TS" ] || fail "No backup directory found under ${BACKUP_BASE}/" | ||
| DEST="${BACKUP_BASE}/${TS}" |
There was a problem hiding this comment.
Avoid the TOCTOU lookup for the backup timestamp.
This rescans ~/.nemoclaw/backups after the backup command finishes. If another backup lands between Line 92 and Line 97, TS can point at the wrong directory and the post-onboard restore uploads a different sandbox's workspace into this one. Please capture the timestamp/path from backup-workspace.sh backup itself instead of asking "what is latest now?".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/upgrade-sandbox.sh` around lines 94 - 100, The current TS lookup is a
TOCTOU race (using ls on BACKUP_BASE) — instead have the backup command itself
emit the created timestamp/path and consume that output instead of rescanning;
modify the code that runs backup-workspace.sh backup to capture its stdout (or
have backup-workspace.sh write the created backup path to a known file) and
assign that value to TS, then validate [ -n "$TS" ] || fail and set
DEST="${BACKUP_BASE}/${TS}" as before; reference the TS variable and the
backup-workspace.sh invocation so you replace the ls-based discovery with the
backup command’s explicit output.
11f856d to
79dd891
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
bin/lib/onboard.js (1)
1229-1239:⚠️ Potential issue | 🟠 MajorEndpoint URL not hydrated for compatible-endpoint providers.
Even after fixing the model/provider persistence issue,
hydrateNonInteractiveInferenceFromRegistry()doesn't restoreNEMOCLAW_ENDPOINT_URL. Non-interactive upgrades of sandboxes usingcompatible-endpointorcompatible-anthropic-endpointwill still fail at lines 1667-1681 where the endpoint URL is required.The registry entry would need to persist
endpointUrlalongside model/provider, and hydration would need to setprocess.env.NEMOCLAW_ENDPOINT_URLfrom that stored field.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1229 - 1239, hydrateNonInteractiveInferenceFromRegistry currently infers and sets NEMOCLAW_PROVIDER from the registry entry (using inferRemoteProviderKeyFromStoredName and entry.provider) but does not restore endpoint information for compatible-endpoint providers; update the registry handling to read the stored endpoint field (e.g., entry.endpointUrl) and, when a compatible-endpoint or compatible-anthropic-endpoint provider is inferred, set process.env.NEMOCLAW_ENDPOINT_URL to that value and log it similarly to the provider note; also ensure the registry persistence logic saves endpointUrl alongside model/provider so hydrateNonInteractiveInferenceFromRegistry can read it.scripts/upgrade-sandbox.sh (1)
94-101:⚠️ Potential issue | 🟡 MinorTOCTOU race when discovering backup timestamp.
This rescans
~/.nemoclaw/backupsafterbackup-workspace.sh backupfinishes. If another backup completes between line 92 and line 97,TScould point to the wrong directory.Consider having
backup-workspace.sh backupemit the created timestamp to stdout, then capture it:-bash "$BACKUP_WORKSPACE" backup "$SANDBOX" - -TS="" -if [ -d "$BACKUP_BASE" ]; then - # shellcheck disable=SC2012 # backup dirs are YYYYMMDD-HHMMSS — no special chars - TS="$(ls -1t "$BACKUP_BASE" 2>/dev/null | head -n1 || true)" -fi +TS="$(bash "$BACKUP_WORKSPACE" backup "$SANDBOX" | tail -n1)" +# Assuming backup-workspace.sh outputs the timestamp on the last lineAlternatively, source
backup-workspace.shand calllatest_backup_timestamp()to avoid duplicating the logic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/upgrade-sandbox.sh` around lines 94 - 101, The code currently discovers the latest backup by re-listing BACKUP_BASE into TS which creates a TOCTOU race; instead have the producer script emit the exact timestamp and consume that instead (e.g. run backup-workspace.sh backup and capture its timestamp stdout into TS) or source backup-workspace.sh and call its latest_backup_timestamp() function to get a deterministic timestamp; update the block that sets TS and DEST to use the captured/returned timestamp (referencing TS, BACKUP_BASE, DEST and the backup-workspace.sh backup invocation or latest_backup_timestamp()) and remove the ls-based discovery.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1217-1245: hydrateNonInteractiveInferenceFromRegistry fails
because registerSandbox never persisted model/provider and
setupInference/createSandbox never call registry.updateSandbox, so
entry.model/provider remain null; fix by persisting the inference config after
sandbox creation or inside setupInference: call
registry.updateSandbox(sandboxName, { model: model /*or modelName var*/,
provider: providerKey /*or provider var*/ }) (and include nimContainer if
applicable) so hydrateNonInteractiveInferenceFromRegistry can read entry.model
and entry.provider; update the call sites (createSandbox/ setupInference) to
pass the correct sandboxName and values.
---
Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1229-1239: hydrateNonInteractiveInferenceFromRegistry currently
infers and sets NEMOCLAW_PROVIDER from the registry entry (using
inferRemoteProviderKeyFromStoredName and entry.provider) but does not restore
endpoint information for compatible-endpoint providers; update the registry
handling to read the stored endpoint field (e.g., entry.endpointUrl) and, when a
compatible-endpoint or compatible-anthropic-endpoint provider is inferred, set
process.env.NEMOCLAW_ENDPOINT_URL to that value and log it similarly to the
provider note; also ensure the registry persistence logic saves endpointUrl
alongside model/provider so hydrateNonInteractiveInferenceFromRegistry can read
it.
In `@scripts/upgrade-sandbox.sh`:
- Around line 94-101: The code currently discovers the latest backup by
re-listing BACKUP_BASE into TS which creates a TOCTOU race; instead have the
producer script emit the exact timestamp and consume that instead (e.g. run
backup-workspace.sh backup and capture its timestamp stdout into TS) or source
backup-workspace.sh and call its latest_backup_timestamp() function to get a
deterministic timestamp; update the block that sets TS and DEST to use the
captured/returned timestamp (referencing TS, BACKUP_BASE, DEST and the
backup-workspace.sh backup invocation or latest_backup_timestamp()) and remove
the ls-based discovery.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 74526e07-c1c6-48fb-ac7e-ebd521ecbb1f
📒 Files selected for processing (4)
bin/lib/onboard.jsdocs/workspace/backup-restore.mdscripts/backup-workspace.shscripts/upgrade-sandbox.sh
✅ Files skipped from review due to trivial changes (1)
- docs/workspace/backup-restore.md
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/backup-workspace.sh
| function hydrateNonInteractiveInferenceFromRegistry() { | ||
| if (!isNonInteractive()) return; | ||
| const name = (process.env.NEMOCLAW_SANDBOX_NAME || "").trim(); | ||
| if (!name) return; | ||
| let entry = registry.getSandbox(name); | ||
| if (!entry || !entry.model) { | ||
| entry = registry.getSandbox(GATEWAY_NAME); | ||
| } | ||
| if (!entry || !entry.model) return; | ||
|
|
||
| const providerKey = (process.env.NEMOCLAW_PROVIDER || "").trim(); | ||
| const modelEnv = (process.env.NEMOCLAW_MODEL || "").trim(); | ||
| if (!providerKey && entry.provider) { | ||
| let inferred = inferRemoteProviderKeyFromStoredName(entry.provider); | ||
| if (entry.provider === "vllm-local" && entry.nimContainer) { | ||
| inferred = "nim-local"; | ||
| } else if (entry.provider === "vllm-local" && !entry.nimContainer) { | ||
| inferred = "vllm"; | ||
| } | ||
| if (inferred) { | ||
| process.env.NEMOCLAW_PROVIDER = inferred; | ||
| note(` [non-interactive] NEMOCLAW_PROVIDER (from registry) → ${inferred}`); | ||
| } | ||
| } | ||
| if (!modelEnv && entry.model) { | ||
| process.env.NEMOCLAW_MODEL = entry.model; | ||
| note(` [non-interactive] NEMOCLAW_MODEL (from registry) → ${entry.model}`); | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that model/provider are not being written to registry anywhere
echo "=== Searching for registry.updateSandbox calls ==="
rg -n "registry\.updateSandbox" --type=js
echo ""
echo "=== Searching for registerSandbox calls to see what's passed ==="
rg -n -A5 "registry\.registerSandbox" --type=js
echo ""
echo "=== Check if model/provider appear in any updateSandbox call ==="
rg -n "updateSandbox.*model|updateSandbox.*provider" --type=jsRepository: NVIDIA/NemoClaw
Length of output: 5357
🏁 Script executed:
#!/bin/bash
# 1. Check the exact line 1225 to see what the early return condition is
echo "=== Lines around 1225 in onboard.js to see the exit condition ==="
sed -n '1220,1230p' bin/lib/onboard.js
echo ""
echo "=== Check the full hydrateNonInteractiveInferenceFromRegistry function ==="
sed -n '1217,1245p' bin/lib/onboard.js
echo ""
echo "=== Search for setupInference function ==="
rg -n -A20 "function setupInference|const setupInference" --type=js bin/lib/onboard.js
echo ""
echo "=== Check GATEWAY_NAME and see if it has model/provider ==="
rg -n "GATEWAY_NAME" --type=js bin/lib/onboard.js | head -20
echo ""
echo "=== Check where GATEWAY_NAME is registered/defined ==="
rg -n "registerSandbox.*GATEWAY_NAME|GATEWAY_NAME.*registerSandbox" --type=jsRepository: NVIDIA/NemoClaw
Length of output: 4398
🏁 Script executed:
#!/bin/bash
# Check if GATEWAY_NAME is ever registered with model/provider
echo "=== Search for registerSandbox with GATEWAY_NAME ==="
rg -n "registerSandbox.*nemoclaw|registerSandbox.*GATEWAY" --type=js
echo ""
echo "=== Full setupInference function to see if it calls updateSandbox ==="
sed -n '1975,2050p' bin/lib/onboard.js
echo ""
echo "=== Search for any updateSandbox calls after setupInference ==="
rg -n -B5 -A5 "updateSandbox.*model|updateSandbox.*provider" --type=js bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 3655
🏁 Script executed:
#!/bin/bash
# Check if hydrateNonInteractiveInferenceFromRegistry is actually called
echo "=== Search for calls to hydrateNonInteractiveInferenceFromRegistry ==="
rg -n "hydrateNonInteractiveInferenceFromRegistry" --type=js
echo ""
echo "=== Check the context where it's called ==="
rg -n -B10 -A3 "hydrateNonInteractiveInferenceFromRegistry()" --type=js bin/lib/onboard.js
echo ""
echo "=== Check if model/provider are ever stored in the registry at all ==="
rg -n "\.model\s*=|\.provider\s*=" --type=js bin/lib/registry.js | head -20
echo ""
echo "=== Check registry.js to see the getSandbox implementation ==="
sed -n '1,150p' bin/lib/registry.jsRepository: NVIDIA/NemoClaw
Length of output: 4050
🏁 Script executed:
#!/bin/bash
# Final verification: check if GATEWAY_NAME entry exists and has model/provider
echo "=== Check if there's any special initialization for GATEWAY_NAME in registry ==="
rg -n "nemoclaw|GATEWAY_NAME" --type=js bin/lib/registry.js
echo ""
echo "=== Check the setupNim flow after hydrateNonInteractiveInferenceFromRegistry ==="
sed -n '1570,1620p' bin/lib/onboard.js
echo ""
echo "=== Verify: does setupNim ever call updateSandbox after setupInference? ==="
sed -n '1570,1750p' bin/lib/onboard.js | grep -n "updateSandbox\|setupInference"Repository: NVIDIA/NemoClaw
Length of output: 2591
🏁 Script executed:
#!/bin/bash
# Check the complete setupNim and related flow to confirm model/provider are never persisted
echo "=== Search for ALL updateSandbox calls in onboard.js ==="
rg -n "updateSandbox" bin/lib/onboard.js
echo ""
echo "=== Search for registerSandbox calls to confirm parameters ==="
rg -n -A4 "registry\.registerSandbox" bin/lib/onboard.js
echo ""
echo "=== Check if model/provider are set on entry object anywhere ==="
rg -n "entry\.model\s*=|entry\.provider\s*=" bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 564
Model and provider are never persisted to the registry, so hydration will always fail.
This function tries to read entry.model and entry.provider from the registry, but these values are never written:
registerSandbox()(lines 1559-1562) only passes{ name, gpuEnabled }, leavingmodelandproviderasnull- The only
updateSandbox()call in onboard.js (line 2248) updates{ nimContainer }only;setupInference()never persists model/provider to the registry
The function will always exit early at line 1225 because entry.model will be null. The fallback to GATEWAY_NAME won't help since that's never registered with inference configuration either.
To make hydration work, add a registry.updateSandbox() call in setupInference or immediately after createSandbox:
// After sandbox creation or in setupInference:
registry.updateSandbox(sandboxName, { model, provider });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 1217 - 1245,
hydrateNonInteractiveInferenceFromRegistry fails because registerSandbox never
persisted model/provider and setupInference/createSandbox never call
registry.updateSandbox, so entry.model/provider remain null; fix by persisting
the inference config after sandbox creation or inside setupInference: call
registry.updateSandbox(sandboxName, { model: model /*or modelName var*/,
provider: providerKey /*or provider var*/ }) (and include nimContainer if
applicable) so hydrateNonInteractiveInferenceFromRegistry can read entry.model
and entry.provider; update the call sites (createSandbox/ setupInference) to
pass the correct sandboxName and values.
79dd891 to
f6c060d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
scripts/backup-workspace.sh (1)
45-55:⚠️ Potential issue | 🟠 MajorDon't downgrade real backup failures to "skipped".
Because
sandbox_download()only returns success/failure, Lines 75-95 still treat every non-zeroopenshell sandbox downloadas a benign skip. That means a transient failure on a core file can still yield a "successful" backup, while optional paths mislabel real gateway/download errors as "not in sandbox."FILES_COREshould fail the run, and the quiet path should only apply when the remote path is definitively missing.Also applies to: 74-95
scripts/upgrade-sandbox.sh (1)
91-100:⚠️ Potential issue | 🟠 MajorCapture the created backup timestamp instead of rescanning
~/.nemoclaw/backups.Line 92 completes the backup, then Lines 95-99 ask “what is latest now?”. If another backup lands in between,
TScan point at the wrong directory and the later restore uploads a different workspace into this sandbox. Havebackup-workspace.sh backupemit the timestamp/path it just created and consume that directly here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/upgrade-sandbox.sh` around lines 91 - 100, The current flow runs bash "$BACKUP_WORKSPACE" backup "$SANDBOX" then rescans BACKUP_BASE to set TS/DEST which can pick up a race; modify the script so the backup command (BACKUP_WORKSPACE) emits the exact timestamp or path it created (e.g., print the created directory on stdout), capture that output into TS (replace the ls/rescan logic) and validate it before setting DEST, and keep using TS/DEST downstream; update variables and error handling around the call to BACKUP_WORKSPACE, TS, and DEST to rely on the captured value instead of rescanning.bin/lib/onboard.js (2)
1239-1247:⚠️ Potential issue | 🟠 MajorPersist
modelandproviderbefore relying on registry hydration.Line 1244 assumes the sandbox entry already carries inference metadata, but sandboxes created by the current flow are still registered with only
{ name, gpuEnabled }, andbin/lib/registry.js:41-56initializesmodel/providertonull. For those entries, this helper falls through and non-interactive upgrades revert to the default build/model instead of the previous selection.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1239 - 1247, hydrateNonInteractiveInferenceFromRegistry fails when registry entries are created without inference metadata because it assumes entry.model/provider are already set; update the sandbox registration flow to persist chosen inference metadata (model and provider) into the registry entry at creation time (the code path that calls registry.getSandbox / registry.add or equivalent), and ensure hydrateNonInteractiveInferenceFromRegistry reads and uses entry.model and entry.provider when present (or sets them on the registry entry) instead of falling back to defaults; reference the hydrateNonInteractiveInferenceFromRegistry helper and the registry.getSandbox/registry.add (or the sandbox registration function) to locate the changes.
1251-1265:⚠️ Potential issue | 🟠 MajorHydrate compatible endpoint URLs too.
This restores
NEMOCLAW_PROVIDERandNEMOCLAW_MODEL, butsetupNim()still requiresNEMOCLAW_ENDPOINT_URLforcompatible-endpointandcompatible-anthropic-endpoint. Non-interactive upgrades of those sandboxes will still fail here unless the endpoint URL is persisted and rehydrated alongside the provider metadata.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1251 - 1265, The current non-interactive rehydration sets NEMOCLAW_PROVIDER and NEMOCLAW_MODEL from entry but misses restoring endpoint URLs, causing setupNim() to fail for providers like "compatible-endpoint" and "compatible-anthropic-endpoint"; update the hydration logic (near inferRemoteProviderKeyFromStoredName and the block that sets NEMOCLAW_MODEL) to also check entry for persisted endpoint fields (e.g., entry.endpointUrl, entry.endpoint_url, or whichever key your registry uses) and, when present, set process.env.NEMOCLAW_ENDPOINT_URL (and any provider-specific variants) and emit a note() similar to the other rehydrated vars so setupNim() can run non-interactively.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1761-1767: The non-interactive onboarding path only falls back to
~/.nemoclaw/credentials.json for NVIDIA via getCredential("NVIDIA_API_KEY");
mirror that behavior for other remote providers by attempting to read
getCredential(credentialEnv) before exiting when !process.env[credentialEnv]; if
a key is returned, set process.env[credentialEnv] = key so the later checks
succeed. Update the code paths that currently do if
(!process.env[credentialEnv]) { ... process.exit(1) } to first call
getCredential(credentialEnv) and assign it into process.env when present,
referencing the getCredential function and the credentialEnv variable used in
the surrounding logic.
In `@scripts/backup-workspace.sh`:
- Around line 132-139: The loop handling FILES_CORE currently logs a warning on
an openshell sandbox upload failure and continues; change this to fail fast by
exiting with a non-zero status so the restore aborts on any core file upload
error. Specifically, inside the for loop that iterates FILES_CORE and calls
openshell sandbox upload "$sandbox" "${src}/${f}" "${WORKSPACE_PATH}/", replace
the warn "Failed to restore ${f}" behavior with an immediate exit (e.g., call
exit 1 or a dedicated fatal/error function) so count isn't incremented and
upgrade-sandbox.sh cannot complete with partial core restores.
In `@scripts/upgrade-sandbox.sh`:
- Around line 103-109: When FULL_DATA is requested, the script must not proceed
if the openshell sandbox download of /sandbox/.openclaw-data/ into
"${DEST}/openclaw-data/" fails; change the logic around the openshell sandbox
download (the block using FULL_DATA, mkdir -p "${DEST}/openclaw-data", and the
openshell command) so you only create the destination directory after a
successful download or save to a temporary location and move it on success, and
on download failure exit non‑zero (do not just warn) to stop subsequent steps
like the "nemoclaw onboard" sequence; also ensure the later restore hint
printing (around the existing restore-hint lines) is suppressed when the
download failed so you don't show misleading restore instructions for an empty
or absent snapshot.
---
Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1239-1247: hydrateNonInteractiveInferenceFromRegistry fails when
registry entries are created without inference metadata because it assumes
entry.model/provider are already set; update the sandbox registration flow to
persist chosen inference metadata (model and provider) into the registry entry
at creation time (the code path that calls registry.getSandbox / registry.add or
equivalent), and ensure hydrateNonInteractiveInferenceFromRegistry reads and
uses entry.model and entry.provider when present (or sets them on the registry
entry) instead of falling back to defaults; reference the
hydrateNonInteractiveInferenceFromRegistry helper and the
registry.getSandbox/registry.add (or the sandbox registration function) to
locate the changes.
- Around line 1251-1265: The current non-interactive rehydration sets
NEMOCLAW_PROVIDER and NEMOCLAW_MODEL from entry but misses restoring endpoint
URLs, causing setupNim() to fail for providers like "compatible-endpoint" and
"compatible-anthropic-endpoint"; update the hydration logic (near
inferRemoteProviderKeyFromStoredName and the block that sets NEMOCLAW_MODEL) to
also check entry for persisted endpoint fields (e.g., entry.endpointUrl,
entry.endpoint_url, or whichever key your registry uses) and, when present, set
process.env.NEMOCLAW_ENDPOINT_URL (and any provider-specific variants) and emit
a note() similar to the other rehydrated vars so setupNim() can run
non-interactively.
In `@scripts/upgrade-sandbox.sh`:
- Around line 91-100: The current flow runs bash "$BACKUP_WORKSPACE" backup
"$SANDBOX" then rescans BACKUP_BASE to set TS/DEST which can pick up a race;
modify the script so the backup command (BACKUP_WORKSPACE) emits the exact
timestamp or path it created (e.g., print the created directory on stdout),
capture that output into TS (replace the ls/rescan logic) and validate it before
setting DEST, and keep using TS/DEST downstream; update variables and error
handling around the call to BACKUP_WORKSPACE, TS, and DEST to rely on the
captured value instead of rescanning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c11f4466-0a2f-45ae-bc93-2cbe616e4ce6
📒 Files selected for processing (4)
bin/lib/onboard.jsdocs/workspace/backup-restore.mdscripts/backup-workspace.shscripts/upgrade-sandbox.sh
✅ Files skipped from review due to trivial changes (1)
- docs/workspace/backup-restore.md
| const key = getCredential("NVIDIA_API_KEY"); | ||
| if (!key) { | ||
| console.error(" NVIDIA_API_KEY is required for NVIDIA Endpoints in non-interactive mode."); | ||
| console.error(" Set it in the environment or ~/.nemoclaw/credentials.json, or run interactive onboard."); | ||
| process.exit(1); | ||
| } | ||
| process.env.NVIDIA_API_KEY = key; |
There was a problem hiding this comment.
Mirror the credential fallback for the other remote providers.
This fixes NVIDIA_API_KEY, but the adjacent non-build path still exits on !process.env[credentialEnv]. Saved OPENAI_API_KEY/ANTHROPIC_API_KEY/GEMINI_API_KEY/compatible-endpoint credentials in ~/.nemoclaw/credentials.json are still ignored in non-interactive mode, so those upgrades keep failing even when the credential already exists.
Minimal follow-up
- if (isNonInteractive()) {
- if (!process.env[credentialEnv]) {
+ if (isNonInteractive()) {
+ const key = getCredential(credentialEnv);
+ if (!key) {
console.error(` ${credentialEnv} is required for ${remoteConfig.label} in non-interactive mode.`);
process.exit(1);
}
+ process.env[credentialEnv] = key;
} else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 1761 - 1767, The non-interactive onboarding
path only falls back to ~/.nemoclaw/credentials.json for NVIDIA via
getCredential("NVIDIA_API_KEY"); mirror that behavior for other remote providers
by attempting to read getCredential(credentialEnv) before exiting when
!process.env[credentialEnv]; if a key is returned, set
process.env[credentialEnv] = key so the later checks succeed. Update the code
paths that currently do if (!process.env[credentialEnv]) { ... process.exit(1) }
to first call getCredential(credentialEnv) and assign it into process.env when
present, referencing the getCredential function and the credentialEnv variable
used in the surrounding logic.
| for f in "${FILES_CORE[@]}"; do | ||
| if [ -f "${src}/${f}" ]; then | ||
| if openshell sandbox upload "$sandbox" "${src}/${f}" "${WORKSPACE_PATH}/"; then | ||
| count=$((count + 1)) | ||
| else | ||
| warn "Failed to restore ${f}" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Fail fast on core restore errors.
Line 137 only warns and continues, so upgrade-sandbox.sh can finish with an incomplete persona if one core upload flakes. For FILES_CORE, this should abort the restore instead of counting a partial result as success.
Minimal hardening
for f in "${FILES_CORE[@]}"; do
if [ -f "${src}/${f}" ]; then
if openshell sandbox upload "$sandbox" "${src}/${f}" "${WORKSPACE_PATH}/"; then
count=$((count + 1))
else
- warn "Failed to restore ${f}"
+ fail "Failed to restore required workspace file ${f}"
fi
fi
done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for f in "${FILES_CORE[@]}"; do | |
| if [ -f "${src}/${f}" ]; then | |
| if openshell sandbox upload "$sandbox" "${src}/${f}" "${WORKSPACE_PATH}/"; then | |
| count=$((count + 1)) | |
| else | |
| warn "Failed to restore ${f}" | |
| fi | |
| fi | |
| for f in "${FILES_CORE[@]}"; do | |
| if [ -f "${src}/${f}" ]; then | |
| if openshell sandbox upload "$sandbox" "${src}/${f}" "${WORKSPACE_PATH}/"; then | |
| count=$((count + 1)) | |
| else | |
| fail "Failed to restore required workspace file ${f}" | |
| fi | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/backup-workspace.sh` around lines 132 - 139, The loop handling
FILES_CORE currently logs a warning on an openshell sandbox upload failure and
continues; change this to fail fast by exiting with a non-zero status so the
restore aborts on any core file upload error. Specifically, inside the for loop
that iterates FILES_CORE and calls openshell sandbox upload "$sandbox"
"${src}/${f}" "${WORKSPACE_PATH}/", replace the warn "Failed to restore ${f}"
behavior with an immediate exit (e.g., call exit 1 or a dedicated fatal/error
function) so count isn't incremented and upgrade-sandbox.sh cannot complete with
partial core restores.
| if [ "$FULL_DATA" -eq 1 ]; then | ||
| info "Step 2/3: Full OpenClaw snapshot (/sandbox/.openclaw-data/)..." | ||
| mkdir -p "${DEST}/openclaw-data" | ||
| if openshell sandbox download "$SANDBOX" /sandbox/.openclaw-data/ "${DEST}/openclaw-data/"; then | ||
| info "Saved openclaw-data under ${DEST}/openclaw-data/" | ||
| else | ||
| warn "openclaw-data download failed or was empty; workspace files in ${DEST}/ are still valid." |
There was a problem hiding this comment.
Don't continue a --full-data upgrade after the snapshot fails.
If the user explicitly asked for --full-data, Lines 106-109 should not just warn and keep going into nemoclaw onboard. A transient failure here can destroy agents/sessions/plugins during the recreate step, and Lines 151-152 can still print a restore hint because the empty destination directory was already created.
Also applies to: 151-152
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/upgrade-sandbox.sh` around lines 103 - 109, When FULL_DATA is
requested, the script must not proceed if the openshell sandbox download of
/sandbox/.openclaw-data/ into "${DEST}/openclaw-data/" fails; change the logic
around the openshell sandbox download (the block using FULL_DATA, mkdir -p
"${DEST}/openclaw-data", and the openshell command) so you only create the
destination directory after a successful download or save to a temporary
location and move it on success, and on download failure exit non‑zero (do not
just warn) to stop subsequent steps like the "nemoclaw onboard" sequence; also
ensure the later restore hint printing (around the existing restore-hint lines)
is suppressed when the download failed so you don't show misleading restore
instructions for an empty or absent snapshot.
Summary
Add a sandbox upgrade workflow that backs up workspace files, optionally snapshots full OpenClaw state, and re-runs
nemoclaw onboard— restoring the workspace automatically afterward. Also enables non-interactive onboard to reuse provider/model settings from the sandbox registry, so upgrades skip inference prompts.Changes
scripts/upgrade-sandbox.sh(new): orchestrates backup → optional--full-datasnapshot → optional--run-onboard→ automatic restorescripts/backup-workspace.sh: split file lists into core/optional (quieter output for missing optional files), addsandbox_download()helper, add portablelatest_backup_timestamp()bin/lib/onboard.js: addinferRemoteProviderKeyFromStoredName()andhydrateNonInteractiveInferenceFromRegistry()to fill provider/model env vars from registry during non-interactive onboard; usegetCredential()forNVIDIA_API_KEYfallbackdocs/workspace/backup-restore.md: document the upgrade workflowType of Change
Testing
npx prek run --all-filespasses./scripts/upgrade-sandbox.sh my-assistantbacks up and prints instructions./scripts/upgrade-sandbox.sh --run-onboard --yes my-assistantruns full cycleChecklist
General
git log --show-signature).Code Changes
npx prek run --all-filesauto-fixes formatting.Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Documentation