fix(onboard): run inference curl probes without shell expansion#890
fix(onboard): run inference curl probes without shell expansion#890OffbeatOps wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
- Use spawnSync("curl", argv) for OpenAI-like probes, Anthropic probes, and /models fetches so credentials are not reinterpreted by bash.
- Add --http1.1 on those requests to reduce HTTP/2-related failures on some networks.
- Report curl exit codes as curl failures (not HTTP status) and include trimmed stderr.
- Trim secrets loaded from the environment and credentials.json and strip CR characters.
📝 WalkthroughWalkthroughAdded a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/credentials.js`:
- Around line 28-38: getCredential currently returns a normalized secret but
does not propagate that cleaned value back into process.env, which lets
onboarding's setupInference -> upsertProvider still read the original env var
(resolvedCredentialEnv) with CR/LF; update getCredential to normalize and then
assign the cleaned value back into process.env[key] before returning (use
normalizeSecret and loadCredentials as needed), or alternatively change the
provider-setup path (setupInference/upsertProvider) to call
getCredential(resolvedCredentialEnv) instead of reading process.env directly so
the cleaned value is consumed when exporting credentials.
In `@bin/lib/onboard.js`:
- Around line 365-369: probeDisplayCode currently only checks result.status and
returns null for spawnSync launch failures; change it to first check
result.error and result.signal and raise/return a clear failure instead of null:
if result.error exists, throw (or return an error object) with a message like
"curl failed to start: <result.error.code|message>", and if result.signal
exists, throw/return "curl terminated by signal <result.signal>". Apply the same
change to the other identical probe block referenced (the logic around the
second probe at the other occurrence) so callers receive actionable error
information rather than HTTP null.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fb779e01-5d57-413c-a5a6-c1868fe72e44
📒 Files selected for processing (2)
bin/lib/credentials.jsbin/lib/onboard.js
| function normalizeSecret(value) { | ||
| if (value == null) return null; | ||
| return String(value).replace(/\r/g, "").trim(); | ||
| } | ||
|
|
||
| function getCredential(key) { | ||
| if (process.env[key]) return process.env[key]; | ||
| if (process.env[key]) return normalizeSecret(process.env[key]); | ||
| const creds = loadCredentials(); | ||
| return creds[key] || null; | ||
| const raw = creds[key]; | ||
| if (raw == null) return null; | ||
| return normalizeSecret(raw); |
There was a problem hiding this comment.
Propagate the cleaned secret into the environment.
getCredential() only returns the normalized value. In non-interactive onboarding, setupInference() still forwards process.env[resolvedCredentialEnv] unchanged to upsertProvider() at bin/lib/onboard.js Line 1957 through Line 1960, so a key with trailing CR/LF can pass the probe here and then still be written to OpenShell with the bad bytes. Please normalize before exporting, or have the provider-setup path consume getCredential() instead of the raw env var.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/credentials.js` around lines 28 - 38, getCredential currently returns
a normalized secret but does not propagate that cleaned value back into
process.env, which lets onboarding's setupInference -> upsertProvider still read
the original env var (resolvedCredentialEnv) with CR/LF; update getCredential to
normalize and then assign the cleaned value back into process.env[key] before
returning (use normalizeSecret and loadCredentials as needed), or alternatively
change the provider-setup path (setupInference/upsertProvider) to call
getCredential(resolvedCredentialEnv) instead of reading process.env directly so
the cleaned value is consumed when exporting credentials.
| function probeDisplayCode(result) { | ||
| const httpCode = Number(String(result.stdout || "").trim()); | ||
| if (result.status !== 0) return result.status; | ||
| return Number.isFinite(httpCode) ? httpCode : 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Node.js child_process.spawnSync, what values do result.status, result.signal, and result.error have when the executable cannot be found or when the child process is terminated by a signal?
💡 Result:
When the executable cannot be found (ENOENT error): - result.status: null - result.signal: null - result.error: Error object with code 'ENOENT' (e.g., Error: spawnSync not_found ENOENT) When the child process is terminated by a signal: - result.status: null - result.signal: the signal name (e.g., 'SIGTERM', 'SIGKILL') - result.error: undefined
Citations:
- 1: doc: child_process.SpawnSync output, stdout, stderr can be null nodejs/node#33458
- 2: https://docs.deno.com/api/node/child_process/~/SpawnSyncReturns
- 3: https://nodejs.org/docs/latest/api/child_process.html
- 4: https://stackoverflow.com/questions/37125619/spawnsyncnpm-install-gives-error-spawnsync-npm-enoent
🏁 Script executed:
cat -n bin/lib/onboard.js | sed -n '360,375p'Repository: NVIDIA/NemoClaw
Length of output: 1870
🏁 Script executed:
cat -n bin/lib/onboard.js | sed -n '525,545p'Repository: NVIDIA/NemoClaw
Length of output: 899
🏁 Script executed:
# Search for callers of probeDisplayCode to understand the impact
rg "probeDisplayCode" bin/lib/onboard.js -B 3 -A 3Repository: NVIDIA/NemoClaw
Length of output: 2136
Handle spawnSync() launch failures before formatting them as HTTP errors.
If curl cannot be started or the child is terminated by a signal, spawnSync() sets result.status to null and reports the failure via result.error (with code 'ENOENT' for missing executable) or result.signal (for signal termination). The probeDisplayCode() helper currently returns null in these cases, so callers end up showing HTTP null with no response body instead of an actionable curl failed to start or curl terminated message.
Check result.error and result.signal in addition to result.status to provide meaningful error messages to users.
Also applies to: 532-536
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 365 - 369, probeDisplayCode currently only
checks result.status and returns null for spawnSync launch failures; change it
to first check result.error and result.signal and raise/return a clear failure
instead of null: if result.error exists, throw (or return an error object) with
a message like "curl failed to start: <result.error.code|message>", and if
result.signal exists, throw/return "curl terminated by signal <result.signal>".
Apply the same change to the other identical probe block referenced (the logic
around the second probe at the other occurrence) so callers receive actionable
error information rather than HTTP null.
## Summary Smooth out inference configuration during `install.sh` / `nemoclaw onboard`, especially when provider authorization, credential formatting, endpoint probing, or final inference application fail. This PR makes the hosted-provider onboarding path recoverable instead of brittle: - normalize and safely handle credential input - classify validation failures more accurately - let users re-enter credentials in place - make final `openshell inference set` failures recoverable - normalize over-specified custom base URLs - add lower-level `back` / `exit` navigation so users can move up a level without restarting the whole install - clarify recovery prompts with explicit commands (`retry`, `back`, `exit`) ## What Changed - refactored provider probe execution to use direct `curl` argv invocation instead of `bash -c` - normalized credential values before use/persistence - added structured auth / transport / model / endpoint failure classification - added in-place credential re-entry for hosted providers: - NVIDIA Endpoints - OpenAI - Anthropic - Google Gemini - custom OpenAI-compatible endpoints - custom Anthropic-compatible endpoints - wrapped final provider/apply failures in interactive recovery instead of hard abort - added command-style recovery prompts: - `retry` - `back` - `exit` - allowed `back` from lower-level inference prompts (model entry, base URL entry, recovery prompts) - normalized custom endpoint inputs to the minimum usable base URL - removed stale `NVIDIA Endpoints (recommended)` wording - secret prompts now show masked `*` feedback while typing/pasting ## Validation ```bash npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js npx vitest run test/cli.test.js npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js npx tsc -p jsconfig.json --noEmit ``` ## Issue Mapping Fully addressed in this PR: - Fixes #1099 - Fixes #1101 - Fixes #1130 Substantially addressed / partially addressed: - #987 - improves NVIDIA validation behavior and failure classification so false/misleading connectivity failures are much less likely, but this PR is framed as onboarding recovery hardening rather than a WSL-specific networking fix - #301 - improves graceful handling when validation/apply fails, especially for transport/upstream problems, but does not add provider auto-fallback or a broader cloud-outage fallback strategy - #446 - improves recovery specifically for the inference-configuration step, but does not fully solve general resumability across all onboarding steps Related implementation direction: - #890 - this PR aligns with the intent of safer/non-shell probe execution and clearer validation reporting - #380 - not implemented here; no automatic provider fallback was added in this branch ## Notes - This PR intentionally does not weaken validation or reopen old onboarding shortcuts. - Unrelated local `tmp/` noise was left out of the branch. Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Interactive onboarding navigation (`back`/`exit`/`quit`) with credential re-prompting and retry flows. * Improved probe/validation flow with clearer recovery options and more robust sandbox build progress messages. * Secret input masks with reliable backspace behavior. * **Bug Fixes** * Credential sanitization (trim/line-ending normalization) and API key validation now normalize and retry instead of exiting. * Better classification and messaging for authorization/validation failures; retries where appropriate. * **Tests** * Expanded tests for credential prompts, masking, retry flows, validation classification, and onboarding navigation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
✨ Thanks for submitting this PR with a detailed summary, it proposes a fix to improve the onboarding experience and prevent issues with curl probes. |
…ript modules (#1240) ## Summary - Extract ~210 lines of pure, side-effect-free functions from the 3,800-line `onboard.js` into **5 typed TypeScript modules** under `src/lib/`: - `gateway-state.ts` — gateway/sandbox state classification from openshell output - `validation.ts` — failure classification, API key validation, model ID checks - `url-utils.ts` — URL normalization, text compaction, env formatting - `build-context.ts` — Docker build context filtering, recovery hints - `dashboard.ts` — dashboard URL resolution and construction - Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the extracted modules - Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/` → `dist/` as CJS - `onboard.js` imports from compiled `dist/lib/` output — transparent to callers - Pre-commit hook updated to build TS and include `dist/lib/` in coverage These functions are **not touched by any #924 blocker PR** (#781, #782, #819, #672, #634, #890), so this extraction is safe to land immediately. ## Test plan - [x] 598 CLI tests pass (542 existing + 56 new) - [x] `tsc -p tsconfig.src.json` compiles cleanly - [x] `tsc -p tsconfig.cli.json` type-checks cleanly - [x] `tsc -p jsconfig.json` type-checks cleanly - [x] Coverage ratchet passes with `dist/lib/` included Closes #1237. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved sandbox-creation recovery hints and targeted remediation commands. * Smarter dashboard URL resolution and control-UI URL construction. * **Bug Fixes** * More accurate gateway and sandbox state detection. * Enhanced classification of validation/apply failures and safer model/key validation. * Better provider URL normalization and loopback handling. * **Tests** * Added comprehensive tests covering new utilities. * **Chores** * CLI now builds before CLI tests; CI/commit hooks updated to run the CLI build. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary Smooth out inference configuration during `install.sh` / `nemoclaw onboard`, especially when provider authorization, credential formatting, endpoint probing, or final inference application fail. This PR makes the hosted-provider onboarding path recoverable instead of brittle: - normalize and safely handle credential input - classify validation failures more accurately - let users re-enter credentials in place - make final `openshell inference set` failures recoverable - normalize over-specified custom base URLs - add lower-level `back` / `exit` navigation so users can move up a level without restarting the whole install - clarify recovery prompts with explicit commands (`retry`, `back`, `exit`) ## What Changed - refactored provider probe execution to use direct `curl` argv invocation instead of `bash -c` - normalized credential values before use/persistence - added structured auth / transport / model / endpoint failure classification - added in-place credential re-entry for hosted providers: - NVIDIA Endpoints - OpenAI - Anthropic - Google Gemini - custom OpenAI-compatible endpoints - custom Anthropic-compatible endpoints - wrapped final provider/apply failures in interactive recovery instead of hard abort - added command-style recovery prompts: - `retry` - `back` - `exit` - allowed `back` from lower-level inference prompts (model entry, base URL entry, recovery prompts) - normalized custom endpoint inputs to the minimum usable base URL - removed stale `NVIDIA Endpoints (recommended)` wording - secret prompts now show masked `*` feedback while typing/pasting ## Validation ```bash npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js npx vitest run test/cli.test.js npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js npx tsc -p jsconfig.json --noEmit ``` ## Issue Mapping Fully addressed in this PR: - Fixes #1099 - Fixes #1101 - Fixes #1130 Substantially addressed / partially addressed: - #987 - improves NVIDIA validation behavior and failure classification so false/misleading connectivity failures are much less likely, but this PR is framed as onboarding recovery hardening rather than a WSL-specific networking fix - #301 - improves graceful handling when validation/apply fails, especially for transport/upstream problems, but does not add provider auto-fallback or a broader cloud-outage fallback strategy - #446 - improves recovery specifically for the inference-configuration step, but does not fully solve general resumability across all onboarding steps Related implementation direction: - #890 - this PR aligns with the intent of safer/non-shell probe execution and clearer validation reporting - #380 - not implemented here; no automatic provider fallback was added in this branch ## Notes - This PR intentionally does not weaken validation or reopen old onboarding shortcuts. - Unrelated local `tmp/` noise was left out of the branch. Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Interactive onboarding navigation (`back`/`exit`/`quit`) with credential re-prompting and retry flows. * Improved probe/validation flow with clearer recovery options and more robust sandbox build progress messages. * Secret input masks with reliable backspace behavior. * **Bug Fixes** * Credential sanitization (trim/line-ending normalization) and API key validation now normalize and retry instead of exiting. * Better classification and messaging for authorization/validation failures; retries where appropriate. * **Tests** * Expanded tests for credential prompts, masking, retry flows, validation classification, and onboarding navigation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ript modules (#1240) ## Summary - Extract ~210 lines of pure, side-effect-free functions from the 3,800-line `onboard.js` into **5 typed TypeScript modules** under `src/lib/`: - `gateway-state.ts` — gateway/sandbox state classification from openshell output - `validation.ts` — failure classification, API key validation, model ID checks - `url-utils.ts` — URL normalization, text compaction, env formatting - `build-context.ts` — Docker build context filtering, recovery hints - `dashboard.ts` — dashboard URL resolution and construction - Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the extracted modules - Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/` → `dist/` as CJS - `onboard.js` imports from compiled `dist/lib/` output — transparent to callers - Pre-commit hook updated to build TS and include `dist/lib/` in coverage These functions are **not touched by any #924 blocker PR** (#781, #782, #819, #672, #634, #890), so this extraction is safe to land immediately. ## Test plan - [x] 598 CLI tests pass (542 existing + 56 new) - [x] `tsc -p tsconfig.src.json` compiles cleanly - [x] `tsc -p tsconfig.cli.json` type-checks cleanly - [x] `tsc -p jsconfig.json` type-checks cleanly - [x] Coverage ratchet passes with `dist/lib/` included Closes #1237. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved sandbox-creation recovery hints and targeted remediation commands. * Smarter dashboard URL resolution and control-UI URL construction. * **Bug Fixes** * More accurate gateway and sandbox state detection. * Enhanced classification of validation/apply failures and safer model/key validation. * Better provider URL normalization and loopback handling. * **Tests** * Added comprehensive tests covering new utilities. * **Chores** * CLI now builds before CLI tests; CI/commit hooks updated to run the CLI build. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Onboarding inference validation builds
curlcommands underbash -cand passes the API token via$NEMOCLAW_PROBE_API_KEY. That is fragile: corrupted or pasted keys (concatenation, stray CR/LF, shell-sensitive characters) can makecurlfail before any HTTP response. Users then saw messages likeHTTP 43even though 43 was the curl exit code, not an HTTP status.Changes
/modelsfetches withspawnSync("curl", argv, …)so headers and bodies are passed as argv (no shell re-parsing of secrets).--http1.1on those requests to reduce occasional HTTP/2-related failures on some networks.curlexits non-zero, report curl failed (exit N) and include a short stderr snippet instead of labeling the code as HTTP.\rfor values read from the environment and~/.nemoclaw/credentials.json.Testing
node --checkon the modified files.Happy to adjust if you prefer keeping HTTP/2 for probes or want the credential normalization scoped to specific keys only.
Summary by CodeRabbit
Bug Fixes
Improvements