fix: validate and adapt data designer + jobs e2e tests for K8s by matthewgrossman · Pull Request #470 · NVIDIA-NeMo/nemo-platform

matthewgrossman · 2026-06-25T19:24:29Z

Summary

Fix data designer jobs on K8s: Added NEMO_JOB_PERSISTENT_JOB_STORAGE_PATH to the data designer plugin's job step environment. Without this, the K8s backend never creates the PVC mount, causing the bridge to crash with KeyError. Every other plugin (evaluator, agents, anonymizer) already did this.
E2E test K8s compatibility: Fixed test_job_passing_data_between_steps (same env var issue), made container-backend tests conditional instead of unconditionally skipped, increased timeouts for K8s OTLP batching and artifact download.
New CI job kind-cpu-e2e: Runs the full test_jobs.py + test_data_designer.py suite against a Kind cluster on every PR.
Extracted setup-kind-cluster composite action: Deduplicates ~165 lines of Kind cluster setup shared between kind-cpu-smoke and the new kind-cpu-e2e job.

Test plan

All 14 tests pass against minikube locally (1 skipped: additional_volumes, 1 deselected: nemotron_personas)
All 8 tests pass in subprocess mode (no regressions)
Data designer unit tests pass (11/11)
kind-cpu-smoke CI job passes
kind-cpu-e2e CI job passes
python-e2e-test CI job passes (subprocess mode)

Closes AIRCORE-844

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a reusable GitHub composite action to provision a local Kind Kubernetes cluster and deploy NeMo Platform for E2E/CPU runs.
- Added an E2E installer entrypoint for minikube.
- Wired persistent job storage path into job execution for Kubernetes runs.
Bug Fixes
- Improved job-launcher execution/log handling for more reliable telemetry and subprocess output draining.
- Hardened E2E tests (timeouts, log checks, skip logic, and pause/resume expectations).
CI Updates
- Refactored Kind smoke/e2e workflows to use the shared action, updated report/artifact naming, and extended CI gating to include the E2E phase.

Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

coderabbitai · 2026-06-25T19:32:25Z

📝 Walkthrough

Walkthrough

The PR adds a reusable Kind setup action, rewires CI to use it, and updates E2E scripts and tests for persistent storage, skip handling, and longer timeouts. It also changes jobs-launcher OTEL exporting and subprocess wait ordering.

Changes

Kind CI setup

Layer / File(s)	Summary
Action interface and tooling `.github/actions/setup-kind-cluster/action.yaml`	The composite action declares cluster, namespace, gateway, registry, tag, Helm values, pull-credential, and NGC inputs, exposes `cluster_url`, and installs disk cleanup, kind, kubectl, Helm, and uv.
Cluster bootstrap and readiness `.github/actions/setup-kind-cluster/action.yaml`	The action starts the Kind cluster with `setup_local_kind_cpu.sh`, sets the kubectl namespace, verifies Gateway API CRDs and gateway resources, and pre-pulls GHCR images.
Helm install and API wait `.github/actions/setup-kind-cluster/action.yaml`	The action runs `install_helm_e2e.sh` with registry/tag and image defaults, prints helm/kubectl diagnostics on failure, writes `cluster_url` to `GITHUB_OUTPUT`, and waits for `/cluster-info` to respond.
CI workflow wiring `.github/workflows/ci.yaml`	The Kind CPU jobs use `setup-kind-cluster`, set `KIND_CLUSTER_NAME`, run pytest against `NMP_E2E_CLUSTER_URL`, rename Kubernetes artifacts and JUnit output to e2e naming, and add `kind-cpu-e2e` to `ci-status`.

E2E harness updates

Layer / File(s)	Summary
Local E2E installer defaults `e2e/k8s/scripts/install_nmp_e2e.sh`	The new script sets default namespace, release name, Helm values, registry/tag, and image variables, then `exec`s `install_helm_e2e.sh`.
Persistent job storage env `e2e/test_jobs.py`, `plugins/nemo-data-designer/src/.../jobs/create.py`	`CreateJob.compile()` adds `PERSISTENT_JOB_STORAGE_PATH_ENVVAR=DEFAULT_JOB_STORAGE_PATH` to the CPU step, and the between-steps e2e test passes the same path into both step environments.
Job test skips and assertions `e2e/test_jobs.py`	The job tests add subprocess-aware skipping from `NMP_BASE_URL`, extend expected-failure log polling, cancel after pause/resume verification, and update additional-volume and invalid-image skips.
Data designer timeout `e2e/test_data_designer.py`	The data-designer tests add a 600-second pytest timeout marker and make artifact download polling use a configurable timeout parameter.

Jobs launcher OTEL and exec flow

Layer / File(s)	Summary
OTEL logger provider `services/core/jobs/jobs-launcher/cmd/otel.go`, `services/core/jobs/jobs-launcher/cmd/run.go`	The launcher switches the OTEL log processor from batch to simple exporting and removes the logger-provider flush path from `runExec`.
Exec wait and output drain `services/core/jobs/jobs-launcher/cmd/run.go`	`runExec` drops the logger-provider parameter, waits for stdout/stderr readers before `cmd.Wait()`, and removes the prior post-wait flush and drain ordering.
Launcher test updates `services/core/jobs/jobs-launcher/cmd/run_test.go`	The `runExec` call sites in the launcher tests are updated to match the new signature with the trailing `nil` removed.

Sequence Diagram(s)

sequenceDiagram
  participant ciYaml as .github/workflows/ci.yaml
  participant setupKindCluster as .github/actions/setup-kind-cluster/action.yaml
  participant setupLocalKindCpu as e2e/k8s/scripts/setup_local_kind_cpu.sh
  participant installHelmE2e as e2e/k8s/scripts/install_helm_e2e.sh
  participant waitForApi as e2e/k8s/scripts/wait_for_api.sh
  ciYaml->>setupKindCluster: invoke reusable action
  setupKindCluster->>setupLocalKindCpu: start Kind cluster
  setupKindCluster->>installHelmE2e: install NeMo Platform
  setupKindCluster->>waitForApi: wait for /cluster-info
  waitForApi-->>setupKindCluster: API healthy
  setupKindCluster-->>ciYaml: cluster_url output

Possibly related PRs

NVIDIA-NeMo/nemo-platform#179: Directly changes e2e/test_jobs.py with overlapping job execution, storage, and log-checking behavior.
NVIDIA-NeMo/nemo-platform#345: Refactors the same Kind-based CI flow in ci.yaml with kind setup, image pre-pulls, Helm install, and readiness checks.

Suggested labels

ci

Suggested reviewers

mckornfield
crookedstorm

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is specific and matches the main K8s e2e test compatibility changes, even if it omits the CI and launcher updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mgrossman/aircore-844-validate-and-adapt-data-designer-e2e-tests-for-minikubek8s

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e/test_jobs.py`:
- Around line 380-381: The subprocess skip marker in test_jobs.py is using
NMP_BASE_URL as a proxy for backend mode, but that setting is only for
external-vs-local platform selection. Update the _is_subprocess_mode /
_skip_subprocess logic to check the real backend or cluster configuration used
by the test setup in e2e/conftest.py, so container-backed Kind/Docker runs are
not incorrectly skipped. Use the existing test configuration symbols around the
subprocess/container backend selection to locate the right signal.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3279eb63-e499-4810-912c-982644780ea6

📥 Commits

Reviewing files that changed from the base of the PR and between dc86134 and 69c80c7.

📒 Files selected for processing (6)

.github/actions/setup-kind-cluster/action.yaml
.github/workflows/ci.yaml
e2e/k8s/scripts/install_nmp_e2e.sh
e2e/test_data_designer.py
e2e/test_jobs.py
plugins/nemo-data-designer/src/nemo_data_designer_plugin/jobs/create.py

coderabbitai · 2026-06-25T19:32:29Z

+_is_subprocess_mode = not os.environ.get("NMP_BASE_URL")
+_skip_subprocess = pytest.mark.skipif(_is_subprocess_mode, reason="Requires container backend (set NMP_BASE_URL)")


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't infer backend mode from NMP_BASE_URL.

NMP_BASE_URL is the external-vs-local platform switch in e2e/conftest.py, not a subprocess-vs-container signal. Local Kind/Docker runs can still be container-backed with this unset, so this marker will skip these tests there. Gate on the actual backend/cluster config instead.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/test_jobs.py` around lines 380 - 381, The subprocess skip marker in test_jobs.py is using NMP_BASE_URL as a proxy for backend mode, but that setting is only for external-vs-local platform selection. Update the _is_subprocess_mode / _skip_subprocess logic to check the real backend or cluster configuration used by the test setup in e2e/conftest.py, so container-backed Kind/Docker runs are not incorrectly skipped. Use the existing test configuration symbols around the subprocess/container backend selection to locate the right signal.

github-actions · 2026-06-25T19:35:43Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21322/27924	76.4%	61.4%
Integration Tests	12350/26693	46.3%	19.7%

Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

…iteration All wait_for_job_logs calls now use a consistent 240s timeout to handle K8s OTLP log batching latency (previously ranged 30-120s, causing flakes). Temporarily pins kind-cpu-e2e to a known image tag to skip the ~10min image build while iterating on test changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

…pods Two bugs caused ~10-20% of short-lived K8s job pods to lose their logs: 1. **Pipe read race**: cmd.Wait() was called before the stdout/stderr reader goroutines finished. Go's exec.Cmd.Wait() closes pipes on return, so the readers would get "file already closed" and miss the output entirely. Fixed by calling wg.Wait() before cmd.Wait(). 2. **Async batch export**: The BatchProcessor queued log records and exported them asynchronously. ForceFlush triggered the export but returned before the HTTP request completed, and os.Exit killed the in-flight request. Switched to SimpleProcessor which exports each record synchronously — appropriate for the launcher's short-lived single-job use case. Verified: 0/50 log misses on minikube (previously 4-7/30). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/core/jobs/jobs-launcher/cmd/run.go`:
- Around line 254-262: The current wait order in the command launcher can hang
when a child keeps stdout/stderr open, so adjust the flow in run.go around
cmd.Wait and wg.Wait. Wait for the process to exit first using cmd.Wait to
capture the main process exit code, then drain the output readers, and add a
bounded fallback so the launcher cannot block forever if EOF never arrives. Keep
the change localized to the launcher logic that coordinates cmd.Wait, wg.Wait,
and the log reader goroutines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e26e3ff-d1a1-4f92-b63c-58a8d9333ec7

📥 Commits

Reviewing files that changed from the base of the PR and between d9d8b37 and 8c13776.

📒 Files selected for processing (3)

services/core/jobs/jobs-launcher/cmd/otel.go
services/core/jobs/jobs-launcher/cmd/run.go
services/core/jobs/jobs-launcher/cmd/run_test.go

coderabbitai · 2026-06-26T04:56:41Z

+	// Wait for all output to be read before calling cmd.Wait().
+	// cmd.Wait() closes stdout/stderr pipes, so readers must finish first.
+	// With the synchronous log processor, each log record is fully exported
+	// (HTTP request completed) before slog.Log returns, so once the readers
+	// finish all logs have already been delivered to the server.
 	wg.Wait()

-	// Force flush the OTEL pipeline to ensure all batched logs are exported
-	// This is especially important for short-lived jobs that fail quickly, where the
-	// batch processor may still have pending logs that haven't been exported yet
-	if loggerProvider != nil {
-		flushCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
-		defer cancel()
-		if flushErr := loggerProvider.ForceFlush(flushCtx); flushErr != nil {
-			logger.Printf("Warning: failed to flush OTEL logs: %v\n", flushErr)
-		}
-	}
+	// Now that all output has been read and exported, wait for the process to finish.
+	err = cmd.Wait()


🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '1,340p' services/core/jobs/jobs-launcher/cmd/run.go

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 8168

🏁 Script executed:

sed -n '1,340p' services/core/jobs/jobs-launcher/cmd/run.go

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 8168

Avoid wg.Wait() before cmd.Wait(). If the command exits while a child keeps stdout/stderr open, EOF never arrives and the launcher hangs instead of returning the main process exit code. Wait for the process first, then drain output with a bounded fallback.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/jobs/jobs-launcher/cmd/run.go` around lines 254 - 262, The current wait order in the command launcher can hang when a child keeps stdout/stderr open, so adjust the flow in run.go around cmd.Wait and wg.Wait. Wait for the process to exit first using cmd.Wait to capture the main process exit code, then drain the output readers, and add a bounded fallback so the launcher cannot block forever if EOF never arrives. Keep the change localized to the launcher logic that coordinates cmd.Wait, wg.Wait, and the log reader goroutines.

fixes

69c80c7

Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

matthewgrossman requested review from a team as code owners June 25, 2026 19:24

github-actions Bot added the fix label Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

matthewgrossman and others added 3 commits June 25, 2026 12:40

use temp images

6ef146d

Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: validate and adapt data designer + jobs e2e tests for K8s#470

fix: validate and adapt data designer + jobs e2e tests for K8s#470
matthewgrossman wants to merge 4 commits into
mainfrom
mgrossman/aircore-844-validate-and-adapt-data-designer-e2e-tests-for-minikubek8s

matthewgrossman commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		_is_subprocess_mode = not os.environ.get("NMP_BASE_URL")
		_skip_subprocess = pytest.mark.skipif(_is_subprocess_mode, reason="Requires container backend (set NMP_BASE_URL)")

Uh oh!

Conversation

matthewgrossman commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthewgrossman commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading