-
Notifications
You must be signed in to change notification settings - Fork 314
Description
The Issue
AWS ParallelCluster 3.6.0 and later, when configured with GPU health checks, may experience delays and eventual "prolog hung" errors when using instance types like p4d and p5. These instances have a complex GPU topology, which results in lengthy diagnostic checks during the GPU health check process. The Prolog, which runs the check before job tasks are started, must complete on all allocated nodes. If one node's GPU health check takes too long, the entire job setup is delayed, causing the "prolog hung" error:
slurmstepd: error: Prolog hung on node xxx
Testing with g4dn and g6 instance types shows that this issue occurs rarely, as these instances have simpler GPU configurations that do not require such long diagnostic times.
Affected Versions
All ParallelCluster 3.6.0+ versions using Slurm scheduler on instance types such as p4d and p5 with GPU health checks enabled are affected. This issue may also occur with other instance types if the GPU health check takes too long.
Mitigation
You can find a detailed explanation and the mitigation of the problem. (3.6.0 ‐ latest) Prolog hangs due to long GPU health check times on certain instance types