Prevent job termination on Slurm node lookup failures #79
+67
−11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a critical bug where
IsNodeDrain()andIsNodeDrained()functions use fail-open error handling, causing immediate pod deletion and job termination when SLURM node lookups fail.Problem:
When the operator cannot query a Slurm node (e.g., node name mismatch, REST API unavailable, or node not yet registered), the current implementation returns
true(indicating the node is drained), which causes the operator to immediately delete pods and kill running jobs..Querying a non-existent Slurm node returns HTTP 200:
Solution:
Changed to fail-closed behavior - return
falseand the error when node status cannot be reliably determined. This prevents premature pod deletion and allows the operator to retry until it can properly verify the node is drained.This fix makes the failure mode safe: if we cannot verify drain status, we wait and retry rather than destroying potentially busy nodes.
Breaking Changes
N/A
Testing
Deployed operator to verify fail-closed behavior:
{"level":"error","ts":"2025-11-14T15:41:06+01:00", "msg":"encountered an error while reconciling request", "controller":"nodeset-controller", "controllerGroup":"slinky.slurm.net", "controllerKind":"NodeSet", "NodeSet":{"name":"slurm-worker-slinky","namespace":"slurm"}, "error":"Not Found", "errorCauses":[{"error":"Not Found"}]}