Skip to content

fix: validate and adapt data designer + jobs e2e tests for K8s#470

Open
matthewgrossman wants to merge 4 commits into
mainfrom
mgrossman/aircore-844-validate-and-adapt-data-designer-e2e-tests-for-minikubek8s
Open

fix: validate and adapt data designer + jobs e2e tests for K8s#470
matthewgrossman wants to merge 4 commits into
mainfrom
mgrossman/aircore-844-validate-and-adapt-data-designer-e2e-tests-for-minikubek8s

Conversation

@matthewgrossman

@matthewgrossman matthewgrossman commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix data designer jobs on K8s: Added NEMO_JOB_PERSISTENT_JOB_STORAGE_PATH to the data designer plugin's job step environment. Without this, the K8s backend never creates the PVC mount, causing the bridge to crash with KeyError. Every other plugin (evaluator, agents, anonymizer) already did this.
  • E2E test K8s compatibility: Fixed test_job_passing_data_between_steps (same env var issue), made container-backend tests conditional instead of unconditionally skipped, increased timeouts for K8s OTLP batching and artifact download.
  • New CI job kind-cpu-e2e: Runs the full test_jobs.py + test_data_designer.py suite against a Kind cluster on every PR.
  • Extracted setup-kind-cluster composite action: Deduplicates ~165 lines of Kind cluster setup shared between kind-cpu-smoke and the new kind-cpu-e2e job.

Test plan

  • All 14 tests pass against minikube locally (1 skipped: additional_volumes, 1 deselected: nemotron_personas)
  • All 8 tests pass in subprocess mode (no regressions)
  • Data designer unit tests pass (11/11)
  • kind-cpu-smoke CI job passes
  • kind-cpu-e2e CI job passes
  • python-e2e-test CI job passes (subprocess mode)

Closes AIRCORE-844

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a reusable GitHub composite action to provision a local Kind Kubernetes cluster and deploy NeMo Platform for E2E/CPU runs.
    • Added an E2E installer entrypoint for minikube.
    • Wired persistent job storage path into job execution for Kubernetes runs.
  • Bug Fixes
    • Improved job-launcher execution/log handling for more reliable telemetry and subprocess output draining.
    • Hardened E2E tests (timeouts, log checks, skip logic, and pause/resume expectations).
  • CI Updates
    • Refactored Kind smoke/e2e workflows to use the shared action, updated report/artifact naming, and extended CI gating to include the E2E phase.

Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
@matthewgrossman matthewgrossman requested review from a team as code owners June 25, 2026 19:24
@github-actions github-actions Bot added the fix label Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds a reusable Kind setup action, rewires CI to use it, and updates E2E scripts and tests for persistent storage, skip handling, and longer timeouts. It also changes jobs-launcher OTEL exporting and subprocess wait ordering.

Changes

Kind CI setup

Layer / File(s) Summary
Action interface and tooling
.github/actions/setup-kind-cluster/action.yaml
The composite action declares cluster, namespace, gateway, registry, tag, Helm values, pull-credential, and NGC inputs, exposes cluster_url, and installs disk cleanup, kind, kubectl, Helm, and uv.
Cluster bootstrap and readiness
.github/actions/setup-kind-cluster/action.yaml
The action starts the Kind cluster with setup_local_kind_cpu.sh, sets the kubectl namespace, verifies Gateway API CRDs and gateway resources, and pre-pulls GHCR images.
Helm install and API wait
.github/actions/setup-kind-cluster/action.yaml
The action runs install_helm_e2e.sh with registry/tag and image defaults, prints helm/kubectl diagnostics on failure, writes cluster_url to GITHUB_OUTPUT, and waits for /cluster-info to respond.
CI workflow wiring
.github/workflows/ci.yaml
The Kind CPU jobs use setup-kind-cluster, set KIND_CLUSTER_NAME, run pytest against NMP_E2E_CLUSTER_URL, rename Kubernetes artifacts and JUnit output to e2e naming, and add kind-cpu-e2e to ci-status.

E2E harness updates

Layer / File(s) Summary
Local E2E installer defaults
e2e/k8s/scripts/install_nmp_e2e.sh
The new script sets default namespace, release name, Helm values, registry/tag, and image variables, then execs install_helm_e2e.sh.
Persistent job storage env
e2e/test_jobs.py, plugins/nemo-data-designer/src/.../jobs/create.py
CreateJob.compile() adds PERSISTENT_JOB_STORAGE_PATH_ENVVAR=DEFAULT_JOB_STORAGE_PATH to the CPU step, and the between-steps e2e test passes the same path into both step environments.
Job test skips and assertions
e2e/test_jobs.py
The job tests add subprocess-aware skipping from NMP_BASE_URL, extend expected-failure log polling, cancel after pause/resume verification, and update additional-volume and invalid-image skips.
Data designer timeout
e2e/test_data_designer.py
The data-designer tests add a 600-second pytest timeout marker and make artifact download polling use a configurable timeout parameter.

Jobs launcher OTEL and exec flow

Layer / File(s) Summary
OTEL logger provider
services/core/jobs/jobs-launcher/cmd/otel.go, services/core/jobs/jobs-launcher/cmd/run.go
The launcher switches the OTEL log processor from batch to simple exporting and removes the logger-provider flush path from runExec.
Exec wait and output drain
services/core/jobs/jobs-launcher/cmd/run.go
runExec drops the logger-provider parameter, waits for stdout/stderr readers before cmd.Wait(), and removes the prior post-wait flush and drain ordering.
Launcher test updates
services/core/jobs/jobs-launcher/cmd/run_test.go
The runExec call sites in the launcher tests are updated to match the new signature with the trailing nil removed.

Sequence Diagram(s)

sequenceDiagram
  participant ciYaml as .github/workflows/ci.yaml
  participant setupKindCluster as .github/actions/setup-kind-cluster/action.yaml
  participant setupLocalKindCpu as e2e/k8s/scripts/setup_local_kind_cpu.sh
  participant installHelmE2e as e2e/k8s/scripts/install_helm_e2e.sh
  participant waitForApi as e2e/k8s/scripts/wait_for_api.sh
  ciYaml->>setupKindCluster: invoke reusable action
  setupKindCluster->>setupLocalKindCpu: start Kind cluster
  setupKindCluster->>installHelmE2e: install NeMo Platform
  setupKindCluster->>waitForApi: wait for /cluster-info
  waitForApi-->>setupKindCluster: API healthy
  setupKindCluster-->>ciYaml: cluster_url output
Loading

Possibly related PRs

Suggested labels

ci

Suggested reviewers

  • mckornfield
  • crookedstorm
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is specific and matches the main K8s e2e test compatibility changes, even if it omits the CI and launcher updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mgrossman/aircore-844-validate-and-adapt-data-designer-e2e-tests-for-minikubek8s

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e/test_jobs.py`:
- Around line 380-381: The subprocess skip marker in test_jobs.py is using
NMP_BASE_URL as a proxy for backend mode, but that setting is only for
external-vs-local platform selection. Update the _is_subprocess_mode /
_skip_subprocess logic to check the real backend or cluster configuration used
by the test setup in e2e/conftest.py, so container-backed Kind/Docker runs are
not incorrectly skipped. Use the existing test configuration symbols around the
subprocess/container backend selection to locate the right signal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3279eb63-e499-4810-912c-982644780ea6

📥 Commits

Reviewing files that changed from the base of the PR and between dc86134 and 69c80c7.

📒 Files selected for processing (6)
  • .github/actions/setup-kind-cluster/action.yaml
  • .github/workflows/ci.yaml
  • e2e/k8s/scripts/install_nmp_e2e.sh
  • e2e/test_data_designer.py
  • e2e/test_jobs.py
  • plugins/nemo-data-designer/src/nemo_data_designer_plugin/jobs/create.py

Comment thread e2e/test_jobs.py
Comment on lines +380 to +381
_is_subprocess_mode = not os.environ.get("NMP_BASE_URL")
_skip_subprocess = pytest.mark.skipif(_is_subprocess_mode, reason="Requires container backend (set NMP_BASE_URL)")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't infer backend mode from NMP_BASE_URL.

NMP_BASE_URL is the external-vs-local platform switch in e2e/conftest.py, not a subprocess-vs-container signal. Local Kind/Docker runs can still be container-backed with this unset, so this marker will skip these tests there. Gate on the actual backend/cluster config instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/test_jobs.py` around lines 380 - 381, The subprocess skip marker in
test_jobs.py is using NMP_BASE_URL as a proxy for backend mode, but that setting
is only for external-vs-local platform selection. Update the _is_subprocess_mode
/ _skip_subprocess logic to check the real backend or cluster configuration used
by the test setup in e2e/conftest.py, so container-backed Kind/Docker runs are
not incorrectly skipped. Use the existing test configuration symbols around the
subprocess/container backend selection to locate the right signal.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 21322/27924 76.4% 61.4%
Integration Tests 12350/26693 46.3% 19.7%

matthewgrossman and others added 3 commits June 25, 2026 12:40
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…iteration

All wait_for_job_logs calls now use a consistent 240s timeout to handle
K8s OTLP log batching latency (previously ranged 30-120s, causing flakes).

Temporarily pins kind-cpu-e2e to a known image tag to skip the ~10min
image build while iterating on test changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…pods

Two bugs caused ~10-20% of short-lived K8s job pods to lose their logs:

1. **Pipe read race**: cmd.Wait() was called before the stdout/stderr
   reader goroutines finished. Go's exec.Cmd.Wait() closes pipes on
   return, so the readers would get "file already closed" and miss the
   output entirely. Fixed by calling wg.Wait() before cmd.Wait().

2. **Async batch export**: The BatchProcessor queued log records and
   exported them asynchronously. ForceFlush triggered the export but
   returned before the HTTP request completed, and os.Exit killed the
   in-flight request. Switched to SimpleProcessor which exports each
   record synchronously — appropriate for the launcher's short-lived
   single-job use case.

Verified: 0/50 log misses on minikube (previously 4-7/30).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/core/jobs/jobs-launcher/cmd/run.go`:
- Around line 254-262: The current wait order in the command launcher can hang
when a child keeps stdout/stderr open, so adjust the flow in run.go around
cmd.Wait and wg.Wait. Wait for the process to exit first using cmd.Wait to
capture the main process exit code, then drain the output readers, and add a
bounded fallback so the launcher cannot block forever if EOF never arrives. Keep
the change localized to the launcher logic that coordinates cmd.Wait, wg.Wait,
and the log reader goroutines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e26e3ff-d1a1-4f92-b63c-58a8d9333ec7

📥 Commits

Reviewing files that changed from the base of the PR and between d9d8b37 and 8c13776.

📒 Files selected for processing (3)
  • services/core/jobs/jobs-launcher/cmd/otel.go
  • services/core/jobs/jobs-launcher/cmd/run.go
  • services/core/jobs/jobs-launcher/cmd/run_test.go

Comment on lines +254 to +262
// Wait for all output to be read before calling cmd.Wait().
// cmd.Wait() closes stdout/stderr pipes, so readers must finish first.
// With the synchronous log processor, each log record is fully exported
// (HTTP request completed) before slog.Log returns, so once the readers
// finish all logs have already been delivered to the server.
wg.Wait()

// Force flush the OTEL pipeline to ensure all batched logs are exported
// This is especially important for short-lived jobs that fail quickly, where the
// batch processor may still have pending logs that haven't been exported yet
if loggerProvider != nil {
flushCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if flushErr := loggerProvider.ForceFlush(flushCtx); flushErr != nil {
logger.Printf("Warning: failed to flush OTEL logs: %v\n", flushErr)
}
}
// Now that all output has been read and exported, wait for the process to finish.
err = cmd.Wait()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '1,340p' services/core/jobs/jobs-launcher/cmd/run.go

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 8168


🏁 Script executed:

sed -n '1,340p' services/core/jobs/jobs-launcher/cmd/run.go

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 8168


Avoid wg.Wait() before cmd.Wait(). If the command exits while a child keeps stdout/stderr open, EOF never arrives and the launcher hangs instead of returning the main process exit code. Wait for the process first, then drain output with a bounded fallback.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/core/jobs/jobs-launcher/cmd/run.go` around lines 254 - 262, The
current wait order in the command launcher can hang when a child keeps
stdout/stderr open, so adjust the flow in run.go around cmd.Wait and wg.Wait.
Wait for the process to exit first using cmd.Wait to capture the main process
exit code, then drain the output readers, and add a bounded fallback so the
launcher cannot block forever if EOF never arrives. Keep the change localized to
the launcher logic that coordinates cmd.Wait, wg.Wait, and the log reader
goroutines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant