Skip to content

fix(onboard): run inference curl probes without shell expansion#890

Open
OffbeatOps wants to merge 1 commit intoNVIDIA:mainfrom
OffbeatOps:fix/onboard-inference-curl-probes
Open

fix(onboard): run inference curl probes without shell expansion#890
OffbeatOps wants to merge 1 commit intoNVIDIA:mainfrom
OffbeatOps:fix/onboard-inference-curl-probes

Conversation

@OffbeatOps
Copy link
Copy Markdown

@OffbeatOps OffbeatOps commented Mar 25, 2026

Summary

Onboarding inference validation builds curl commands under bash -c and passes the API token via $NEMOCLAW_PROBE_API_KEY. That is fragile: corrupted or pasted keys (concatenation, stray CR/LF, shell-sensitive characters) can make curl fail before any HTTP response. Users then saw messages like HTTP 43 even though 43 was the curl exit code, not an HTTP status.

Changes

  • Run the same probes and /models fetches with spawnSync("curl", argv, …) so headers and bodies are passed as argv (no shell re-parsing of secrets).
  • Add --http1.1 on those requests to reduce occasional HTTP/2-related failures on some networks.
  • When curl exits non-zero, report curl failed (exit N) and include a short stderr snippet instead of labeling the code as HTTP.
  • Normalize credentials: trim and strip \r for values read from the environment and ~/.nemoclaw/credentials.json.

Testing

  • node --check on the modified files.
  • Manually exercised NVIDIA Endpoints onboarding after these changes (chat completions probe succeeds; Responses probe still 404 as expected for integrate.api.nvidia.com).

Happy to adjust if you prefer keeping HTTP/2 for probes or want the credential normalization scoped to specific keys only.

Summary by CodeRabbit

  • Bug Fixes

    • Improved credential normalization to consistently handle whitespace and carriage returns across all secret sources.
    • Enhanced error messages when probing network endpoints with additional diagnostic details.
  • Improvements

    • Better HTTP status code detection and error reporting during network configuration steps.

- Use spawnSync("curl", argv) for OpenAI-like probes, Anthropic probes, and /models fetches so credentials are not reinterpreted by bash.
- Add --http1.1 on those requests to reduce HTTP/2-related failures on some networks.
- Report curl exit codes as curl failures (not HTTP status) and include trimmed stderr.
- Trim secrets loaded from the environment and credentials.json and strip CR characters.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Added a normalizeSecret helper function to sanitize credentials by removing carriage returns and whitespace. Refactored network probing functions in onboard.js to replace shell-based curl invocations with direct spawnSync("curl", ...) calls, eliminating environment variable injection in shell commands and improving error reporting with stderr output.

Changes

Cohort / File(s) Summary
Credential Normalization
bin/lib/credentials.js
Added normalizeSecret() helper to standardize credential handling by removing \r characters and trimming whitespace. Updated getCredential() to apply normalization consistently to both environment and file-based credentials, returning null for absent/empty values.
Network Probe Refactoring
bin/lib/onboard.js
Replaced shell-constructed bash commands with direct spawnSync("curl", ...) invocations across five probe/fetch functions. Added getCurlSpawnArgs() and probeDisplayCode() helpers. Enhanced summarizeProbeError() to include truncated stderr output for curl exit codes (1–99), improving diagnostic information.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Secrets now trimmed and all shiny and clean,
No carriage returns mucking up the scene,
With curl now spawned, no shell tricks in sight,
Our probes run direct—secure and tight! 🔐

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): run inference curl probes without shell expansion' directly and accurately summarizes the main change: replacing bash-spawned curl commands with direct spawnSync curl calls to avoid shell expansion issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/credentials.js`:
- Around line 28-38: getCredential currently returns a normalized secret but
does not propagate that cleaned value back into process.env, which lets
onboarding's setupInference -> upsertProvider still read the original env var
(resolvedCredentialEnv) with CR/LF; update getCredential to normalize and then
assign the cleaned value back into process.env[key] before returning (use
normalizeSecret and loadCredentials as needed), or alternatively change the
provider-setup path (setupInference/upsertProvider) to call
getCredential(resolvedCredentialEnv) instead of reading process.env directly so
the cleaned value is consumed when exporting credentials.

In `@bin/lib/onboard.js`:
- Around line 365-369: probeDisplayCode currently only checks result.status and
returns null for spawnSync launch failures; change it to first check
result.error and result.signal and raise/return a clear failure instead of null:
if result.error exists, throw (or return an error object) with a message like
"curl failed to start: <result.error.code|message>", and if result.signal
exists, throw/return "curl terminated by signal <result.signal>". Apply the same
change to the other identical probe block referenced (the logic around the
second probe at the other occurrence) so callers receive actionable error
information rather than HTTP null.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fb779e01-5d57-413c-a5a6-c1868fe72e44

📥 Commits

Reviewing files that changed from the base of the PR and between b2164e7 and c5658ad.

📒 Files selected for processing (2)
  • bin/lib/credentials.js
  • bin/lib/onboard.js

Comment on lines +28 to +38
function normalizeSecret(value) {
if (value == null) return null;
return String(value).replace(/\r/g, "").trim();
}

function getCredential(key) {
if (process.env[key]) return process.env[key];
if (process.env[key]) return normalizeSecret(process.env[key]);
const creds = loadCredentials();
return creds[key] || null;
const raw = creds[key];
if (raw == null) return null;
return normalizeSecret(raw);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate the cleaned secret into the environment.

getCredential() only returns the normalized value. In non-interactive onboarding, setupInference() still forwards process.env[resolvedCredentialEnv] unchanged to upsertProvider() at bin/lib/onboard.js Line 1957 through Line 1960, so a key with trailing CR/LF can pass the probe here and then still be written to OpenShell with the bad bytes. Please normalize before exporting, or have the provider-setup path consume getCredential() instead of the raw env var.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/credentials.js` around lines 28 - 38, getCredential currently returns
a normalized secret but does not propagate that cleaned value back into
process.env, which lets onboarding's setupInference -> upsertProvider still read
the original env var (resolvedCredentialEnv) with CR/LF; update getCredential to
normalize and then assign the cleaned value back into process.env[key] before
returning (use normalizeSecret and loadCredentials as needed), or alternatively
change the provider-setup path (setupInference/upsertProvider) to call
getCredential(resolvedCredentialEnv) instead of reading process.env directly so
the cleaned value is consumed when exporting credentials.

Comment on lines +365 to +369
function probeDisplayCode(result) {
const httpCode = Number(String(result.stdout || "").trim());
if (result.status !== 0) return result.status;
return Number.isFinite(httpCode) ? httpCode : 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Node.js child_process.spawnSync, what values do result.status, result.signal, and result.error have when the executable cannot be found or when the child process is terminated by a signal?

💡 Result:

When the executable cannot be found (ENOENT error): - result.status: null - result.signal: null - result.error: Error object with code 'ENOENT' (e.g., Error: spawnSync not_found ENOENT) When the child process is terminated by a signal: - result.status: null - result.signal: the signal name (e.g., 'SIGTERM', 'SIGKILL') - result.error: undefined

Citations:


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '360,375p'

Repository: NVIDIA/NemoClaw

Length of output: 1870


🏁 Script executed:

cat -n bin/lib/onboard.js | sed -n '525,545p'

Repository: NVIDIA/NemoClaw

Length of output: 899


🏁 Script executed:

# Search for callers of probeDisplayCode to understand the impact
rg "probeDisplayCode" bin/lib/onboard.js -B 3 -A 3

Repository: NVIDIA/NemoClaw

Length of output: 2136


Handle spawnSync() launch failures before formatting them as HTTP errors.

If curl cannot be started or the child is terminated by a signal, spawnSync() sets result.status to null and reports the failure via result.error (with code 'ENOENT' for missing executable) or result.signal (for signal termination). The probeDisplayCode() helper currently returns null in these cases, so callers end up showing HTTP null with no response body instead of an actionable curl failed to start or curl terminated message.

Check result.error and result.signal in addition to result.status to provide meaningful error messages to users.

Also applies to: 532-536

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 365 - 369, probeDisplayCode currently only
checks result.status and returns null for spawnSync launch failures; change it
to first check result.error and result.signal and raise/return a clear failure
instead of null: if result.error exists, throw (or return an error object) with
a message like "curl failed to start: <result.error.code|message>", and if
result.signal exists, throw/return "curl terminated by signal <result.signal>".
Apply the same change to the other identical probe block referenced (the logic
around the second probe at the other occurrence) so callers receive actionable
error information rather than HTTP null.

kjw3 added a commit that referenced this pull request Mar 31, 2026
## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes #1099
- Fixes #1101
- Fixes #1130

Substantially addressed / partially addressed:
- #987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- #301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- #446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- #890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- #380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the fix label Mar 31, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR with a detailed summary, it proposes a fix to improve the onboarding experience and prevent issues with curl probes.

@wscurran wscurran added the Getting Started Use this label to identify setup, installation, or onboarding issues. label Mar 31, 2026
cv added a commit that referenced this pull request Apr 1, 2026
…ript modules (#1240)

## Summary

- Extract ~210 lines of pure, side-effect-free functions from the
3,800-line `onboard.js` into **5 typed TypeScript modules** under
`src/lib/`:
- `gateway-state.ts` — gateway/sandbox state classification from
openshell output
- `validation.ts` — failure classification, API key validation, model ID
checks
  - `url-utils.ts` — URL normalization, text compaction, env formatting
  - `build-context.ts` — Docker build context filtering, recovery hints
  - `dashboard.ts` — dashboard URL resolution and construction
- Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the
extracted modules
- Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/`
→ `dist/` as CJS
- `onboard.js` imports from compiled `dist/lib/` output — transparent to
callers
- Pre-commit hook updated to build TS and include `dist/lib/` in
coverage

These functions are **not touched by any #924 blocker PR** (#781, #782,
#819, #672, #634, #890), so this extraction is safe to land immediately.

## Test plan

- [x] 598 CLI tests pass (542 existing + 56 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly
- [x] Coverage ratchet passes with `dist/lib/` included

Closes #1237. Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved sandbox-creation recovery hints and targeted remediation
commands.
  * Smarter dashboard URL resolution and control-UI URL construction.

* **Bug Fixes**
  * More accurate gateway and sandbox state detection.
* Enhanced classification of validation/apply failures and safer
model/key validation.
  * Better provider URL normalization and loopback handling.

* **Tests**
  * Added comprehensive tests covering new utilities.

* **Chores**
* CLI now builds before CLI tests; CI/commit hooks updated to run the
CLI build.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary

Smooth out inference configuration during `install.sh` / `nemoclaw
onboard`, especially when provider authorization, credential formatting,
endpoint probing, or final inference application fail.

This PR makes the hosted-provider onboarding path recoverable instead of
brittle:
- normalize and safely handle credential input
- classify validation failures more accurately
- let users re-enter credentials in place
- make final `openshell inference set` failures recoverable
- normalize over-specified custom base URLs
- add lower-level `back` / `exit` navigation so users can move up a
level without restarting the whole install
- clarify recovery prompts with explicit commands (`retry`, `back`,
`exit`)

## What Changed

- refactored provider probe execution to use direct `curl` argv
invocation instead of `bash -c`
- normalized credential values before use/persistence
- added structured auth / transport / model / endpoint failure
classification
- added in-place credential re-entry for hosted providers:
  - NVIDIA Endpoints
  - OpenAI
  - Anthropic
  - Google Gemini
  - custom OpenAI-compatible endpoints
  - custom Anthropic-compatible endpoints
- wrapped final provider/apply failures in interactive recovery instead
of hard abort
- added command-style recovery prompts:
  - `retry`
  - `back`
  - `exit`
- allowed `back` from lower-level inference prompts (model entry, base
URL entry, recovery prompts)
- normalized custom endpoint inputs to the minimum usable base URL
- removed stale `NVIDIA Endpoints (recommended)` wording
- secret prompts now show masked `*` feedback while typing/pasting

## Validation

```bash
npx vitest run test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx vitest run test/cli.test.js
npx eslint bin/lib/credentials.js bin/lib/onboard.js test/credentials.test.js test/onboard-selection.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
```

## Issue Mapping

Fully addressed in this PR:
- Fixes #1099
- Fixes #1101
- Fixes #1130

Substantially addressed / partially addressed:
- #987
- improves NVIDIA validation behavior and failure classification so
false/misleading connectivity failures are much less likely, but this PR
is framed as onboarding recovery hardening rather than a WSL-specific
networking fix
- #301
- improves graceful handling when validation/apply fails, especially for
transport/upstream problems, but does not add provider auto-fallback or
a broader cloud-outage fallback strategy
- #446
- improves recovery specifically for the inference-configuration step,
but does not fully solve general resumability across all onboarding
steps

Related implementation direction:
- #890
- this PR aligns with the intent of safer/non-shell probe execution and
clearer validation reporting
- #380
- not implemented here; no automatic provider fallback was added in this
branch

## Notes

- This PR intentionally does not weaken validation or reopen old
onboarding shortcuts.
- Unrelated local `tmp/` noise was left out of the branch.

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Interactive onboarding navigation (`back`/`exit`/`quit`) with
credential re-prompting and retry flows.
* Improved probe/validation flow with clearer recovery options and more
robust sandbox build progress messages.
  * Secret input masks with reliable backspace behavior.

* **Bug Fixes**
* Credential sanitization (trim/line-ending normalization) and API key
validation now normalize and retry instead of exiting.
* Better classification and messaging for authorization/validation
failures; retries where appropriate.

* **Tests**
* Expanded tests for credential prompts, masking, retry flows,
validation classification, and onboarding navigation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
…ript modules (#1240)

## Summary

- Extract ~210 lines of pure, side-effect-free functions from the
3,800-line `onboard.js` into **5 typed TypeScript modules** under
`src/lib/`:
- `gateway-state.ts` — gateway/sandbox state classification from
openshell output
- `validation.ts` — failure classification, API key validation, model ID
checks
  - `url-utils.ts` — URL normalization, text compaction, env formatting
  - `build-context.ts` — Docker build context filtering, recovery hints
  - `dashboard.ts` — dashboard URL resolution and construction
- Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the
extracted modules
- Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/`
→ `dist/` as CJS
- `onboard.js` imports from compiled `dist/lib/` output — transparent to
callers
- Pre-commit hook updated to build TS and include `dist/lib/` in
coverage

These functions are **not touched by any #924 blocker PR** (#781, #782,
#819, #672, #634, #890), so this extraction is safe to land immediately.

## Test plan

- [x] 598 CLI tests pass (542 existing + 56 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly
- [x] Coverage ratchet passes with `dist/lib/` included

Closes #1237. Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved sandbox-creation recovery hints and targeted remediation
commands.
  * Smarter dashboard URL resolution and control-UI URL construction.

* **Bug Fixes**
  * More accurate gateway and sandbox state detection.
* Enhanced classification of validation/apply failures and safer
model/key validation.
  * Better provider URL normalization and loopback handling.

* **Tests**
  * Added comprehensive tests covering new utilities.

* **Chores**
* CLI now builds before CLI tests; CI/commit hooks updated to run the
CLI build.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Getting Started Use this label to identify setup, installation, or onboarding issues.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants