Skip to content

Conversation

@faganihajizada
Copy link
Contributor

Summary

This PR fixes a critical bug where IsNodeDrain() and IsNodeDrained() functions use fail-open error handling, causing immediate pod deletion and job termination when SLURM node lookups fail.

Problem:

When the operator cannot query a Slurm node (e.g., node name mismatch, REST API unavailable, or node not yet registered), the current implementation returns true (indicating the node is drained), which causes the operator to immediately delete pods and kill running jobs..

Querying a non-existent Slurm node returns HTTP 200:

{ "errors": [ { "description": "Failure to query node bad-node-name", "error_number": 0, "error": "No error", "source": "_dump_nodes" } ], "warnings": [], "meta": { "plugin": { "type": "openapi\/slurmctld", "name": "Slurm OpenAPI slurmctld", "data_parser": "data_parser\/v0.0.43", "accounting_storage": "" }, "client": { "source": "localhost:6820(fd:11)", "user": "root", "group": "root" }, "command": [], "slurm": { "version": { "major": "25", "micro": "3", "minor": "05" }, "release": "25.05.3", "cluster": "slurm" } } } HTTP_STATUS:200

Solution:

Changed to fail-closed behavior - return false and the error when node status cannot be reliably determined. This prevents premature pod deletion and allows the operator to retry until it can properly verify the node is drained.

This fix makes the failure mode safe: if we cannot verify drain status, we wait and retry rather than destroying potentially busy nodes.

Breaking Changes

N/A

Testing

Deployed operator to verify fail-closed behavior:

{"level":"error","ts":"2025-11-14T15:41:06+01:00",
 "msg":"encountered an error while reconciling request",
 "controller":"nodeset-controller",
 "controllerGroup":"slinky.slurm.net",
 "controllerKind":"NodeSet",
 "NodeSet":{"name":"slurm-worker-slinky","namespace":"slurm"},
 "error":"Not Found",
 "errorCauses":[{"error":"Not Found"}]}

@faganihajizada faganihajizada marked this pull request as ready for review November 14, 2025 14:58
Change IsNodeDrain() and IsNodeDrained() to fail-closed behavior.
Previously, these functions returned true (drained) when Slurm node
lookups failed, causing immediate pod deletion and job termination.

Now returns false and propagates errors, allowing the operator to
retry until it can properly verify drain status. This protects
running jobs from premature termination during transient failures
or node name mismatches.
@faganihajizada faganihajizada force-pushed the fix/node-drain-fail-closed branch from 4326ea1 to 319dd4c Compare November 14, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant