Prevent job termination on Slurm node lookup failures #79

faganihajizada · 2025-11-14T14:55:32Z

Summary

This PR fixes a critical bug where IsNodeDrain() and IsNodeDrained() functions use fail-open error handling, causing immediate pod deletion and job termination when SLURM node lookups fail.

Problem:

When the operator cannot query a Slurm node (e.g., node name mismatch, REST API unavailable, or node not yet registered), the current implementation returns true (indicating the node is drained), which causes the operator to immediately delete pods and kill running jobs..

Querying a non-existent Slurm node returns HTTP 200:

{ "errors": [ { "description": "Failure to query node bad-node-name", "error_number": 0, "error": "No error", "source": "_dump_nodes" } ], "warnings": [], "meta": { "plugin": { "type": "openapi\/slurmctld", "name": "Slurm OpenAPI slurmctld", "data_parser": "data_parser\/v0.0.43", "accounting_storage": "" }, "client": { "source": "localhost:6820(fd:11)", "user": "root", "group": "root" }, "command": [], "slurm": { "version": { "major": "25", "micro": "3", "minor": "05" }, "release": "25.05.3", "cluster": "slurm" } } } HTTP_STATUS:200

Solution:

Changed to fail-closed behavior - return false and the error when node status cannot be reliably determined. This prevents premature pod deletion and allows the operator to retry until it can properly verify the node is drained.

This fix makes the failure mode safe: if we cannot verify drain status, we wait and retry rather than destroying potentially busy nodes.

Breaking Changes

N/A

Testing

Deployed operator to verify fail-closed behavior:

{"level":"error","ts":"2025-11-14T15:41:06+01:00",
 "msg":"encountered an error while reconciling request",
 "controller":"nodeset-controller",
 "controllerGroup":"slinky.slurm.net",
 "controllerKind":"NodeSet",
 "NodeSet":{"name":"slurm-worker-slinky","namespace":"slurm"},
 "error":"Not Found",
 "errorCauses":[{"error":"Not Found"}]}

Change IsNodeDrain() and IsNodeDrained() to fail-closed behavior. Previously, these functions returned true (drained) when Slurm node lookups failed, causing immediate pod deletion and job termination. Now returns false and propagates errors, allowing the operator to retry until it can properly verify drain status. This protects running jobs from premature termination during transient failures or node name mismatches.

faganihajizada marked this pull request as ready for review November 14, 2025 14:58

faganihajizada force-pushed the fix/node-drain-fail-closed branch from 4326ea1 to 319dd4c Compare November 14, 2025 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent job termination on Slurm node lookup failures #79

Prevent job termination on Slurm node lookup failures #79

faganihajizada commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Prevent job termination on Slurm node lookup failures #79

Are you sure you want to change the base?

Prevent job termination on Slurm node lookup failures #79

Conversation

faganihajizada commented Nov 14, 2025

Summary

Breaking Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant