fix: validate and adapt data designer + jobs e2e tests for K8s#470
Conversation
📝 WalkthroughWalkthroughThe PR adds a reusable Kind setup action, rewires CI to use it, and updates E2E scripts and tests for persistent storage, skip handling, and longer timeouts. It also changes jobs-launcher OTEL exporting and subprocess wait ordering. ChangesKind CI setup
E2E harness updates
Jobs launcher OTEL and exec flow
Sequence Diagram(s)sequenceDiagram
participant ciYaml as .github/workflows/ci.yaml
participant setupKindCluster as .github/actions/setup-kind-cluster/action.yaml
participant setupLocalKindCpu as e2e/k8s/scripts/setup_local_kind_cpu.sh
participant installHelmE2e as e2e/k8s/scripts/install_helm_e2e.sh
participant waitForApi as e2e/k8s/scripts/wait_for_api.sh
ciYaml->>setupKindCluster: invoke reusable action
setupKindCluster->>setupLocalKindCpu: start Kind cluster
setupKindCluster->>installHelmE2e: install NeMo Platform
setupKindCluster->>waitForApi: wait for /cluster-info
waitForApi-->>setupKindCluster: API healthy
setupKindCluster-->>ciYaml: cluster_url output
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e/test_jobs.py`:
- Around line 380-381: The subprocess skip marker in test_jobs.py is using
NMP_BASE_URL as a proxy for backend mode, but that setting is only for
external-vs-local platform selection. Update the _is_subprocess_mode /
_skip_subprocess logic to check the real backend or cluster configuration used
by the test setup in e2e/conftest.py, so container-backed Kind/Docker runs are
not incorrectly skipped. Use the existing test configuration symbols around the
subprocess/container backend selection to locate the right signal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3279eb63-e499-4810-912c-982644780ea6
📒 Files selected for processing (6)
.github/actions/setup-kind-cluster/action.yaml.github/workflows/ci.yamle2e/k8s/scripts/install_nmp_e2e.she2e/test_data_designer.pye2e/test_jobs.pyplugins/nemo-data-designer/src/nemo_data_designer_plugin/jobs/create.py
| _is_subprocess_mode = not os.environ.get("NMP_BASE_URL") | ||
| _skip_subprocess = pytest.mark.skipif(_is_subprocess_mode, reason="Requires container backend (set NMP_BASE_URL)") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Don't infer backend mode from NMP_BASE_URL.
NMP_BASE_URL is the external-vs-local platform switch in e2e/conftest.py, not a subprocess-vs-container signal. Local Kind/Docker runs can still be container-backed with this unset, so this marker will skip these tests there. Gate on the actual backend/cluster config instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@e2e/test_jobs.py` around lines 380 - 381, The subprocess skip marker in
test_jobs.py is using NMP_BASE_URL as a proxy for backend mode, but that setting
is only for external-vs-local platform selection. Update the _is_subprocess_mode
/ _skip_subprocess logic to check the real backend or cluster configuration used
by the test setup in e2e/conftest.py, so container-backed Kind/Docker runs are
not incorrectly skipped. Use the existing test configuration symbols around the
subprocess/container backend selection to locate the right signal.
|
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…iteration All wait_for_job_logs calls now use a consistent 240s timeout to handle K8s OTLP log batching latency (previously ranged 30-120s, causing flakes). Temporarily pins kind-cpu-e2e to a known image tag to skip the ~10min image build while iterating on test changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…pods Two bugs caused ~10-20% of short-lived K8s job pods to lose their logs: 1. **Pipe read race**: cmd.Wait() was called before the stdout/stderr reader goroutines finished. Go's exec.Cmd.Wait() closes pipes on return, so the readers would get "file already closed" and miss the output entirely. Fixed by calling wg.Wait() before cmd.Wait(). 2. **Async batch export**: The BatchProcessor queued log records and exported them asynchronously. ForceFlush triggered the export but returned before the HTTP request completed, and os.Exit killed the in-flight request. Switched to SimpleProcessor which exports each record synchronously — appropriate for the launcher's short-lived single-job use case. Verified: 0/50 log misses on minikube (previously 4-7/30). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@services/core/jobs/jobs-launcher/cmd/run.go`:
- Around line 254-262: The current wait order in the command launcher can hang
when a child keeps stdout/stderr open, so adjust the flow in run.go around
cmd.Wait and wg.Wait. Wait for the process to exit first using cmd.Wait to
capture the main process exit code, then drain the output readers, and add a
bounded fallback so the launcher cannot block forever if EOF never arrives. Keep
the change localized to the launcher logic that coordinates cmd.Wait, wg.Wait,
and the log reader goroutines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8e26e3ff-d1a1-4f92-b63c-58a8d9333ec7
📒 Files selected for processing (3)
services/core/jobs/jobs-launcher/cmd/otel.goservices/core/jobs/jobs-launcher/cmd/run.goservices/core/jobs/jobs-launcher/cmd/run_test.go
| // Wait for all output to be read before calling cmd.Wait(). | ||
| // cmd.Wait() closes stdout/stderr pipes, so readers must finish first. | ||
| // With the synchronous log processor, each log record is fully exported | ||
| // (HTTP request completed) before slog.Log returns, so once the readers | ||
| // finish all logs have already been delivered to the server. | ||
| wg.Wait() | ||
|
|
||
| // Force flush the OTEL pipeline to ensure all batched logs are exported | ||
| // This is especially important for short-lived jobs that fail quickly, where the | ||
| // batch processor may still have pending logs that haven't been exported yet | ||
| if loggerProvider != nil { | ||
| flushCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
| defer cancel() | ||
| if flushErr := loggerProvider.ForceFlush(flushCtx); flushErr != nil { | ||
| logger.Printf("Warning: failed to flush OTEL logs: %v\n", flushErr) | ||
| } | ||
| } | ||
| // Now that all output has been read and exported, wait for the process to finish. | ||
| err = cmd.Wait() |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
sed -n '1,340p' services/core/jobs/jobs-launcher/cmd/run.goRepository: NVIDIA-NeMo/nemo-platform
Length of output: 8168
🏁 Script executed:
sed -n '1,340p' services/core/jobs/jobs-launcher/cmd/run.goRepository: NVIDIA-NeMo/nemo-platform
Length of output: 8168
Avoid wg.Wait() before cmd.Wait(). If the command exits while a child keeps stdout/stderr open, EOF never arrives and the launcher hangs instead of returning the main process exit code. Wait for the process first, then drain output with a bounded fallback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/core/jobs/jobs-launcher/cmd/run.go` around lines 254 - 262, The
current wait order in the command launcher can hang when a child keeps
stdout/stderr open, so adjust the flow in run.go around cmd.Wait and wg.Wait.
Wait for the process to exit first using cmd.Wait to capture the main process
exit code, then drain the output readers, and add a bounded fallback so the
launcher cannot block forever if EOF never arrives. Keep the change localized to
the launcher logic that coordinates cmd.Wait, wg.Wait, and the log reader
goroutines.
Summary
NEMO_JOB_PERSISTENT_JOB_STORAGE_PATHto the data designer plugin's job step environment. Without this, the K8s backend never creates the PVC mount, causing the bridge to crash withKeyError. Every other plugin (evaluator, agents, anonymizer) already did this.test_job_passing_data_between_steps(same env var issue), made container-backend tests conditional instead of unconditionally skipped, increased timeouts for K8s OTLP batching and artifact download.kind-cpu-e2e: Runs the fulltest_jobs.py+test_data_designer.pysuite against a Kind cluster on every PR.setup-kind-clustercomposite action: Deduplicates ~165 lines of Kind cluster setup shared betweenkind-cpu-smokeand the newkind-cpu-e2ejob.Test plan
kind-cpu-smokeCI job passeskind-cpu-e2eCI job passespython-e2e-testCI job passes (subprocess mode)Closes AIRCORE-844
🤖 Generated with Claude Code
Summary by CodeRabbit