Skip to content

(3.6.0 - latest) Prolog hangs due to long GPU health check times on certain instance types #6777

@hehe7318

Description

@hehe7318

The Issue

AWS ParallelCluster 3.6.0 and later, when configured with GPU health checks, may experience delays and eventual "prolog hung" errors when using instance types like p4d and p5. These instances have a complex GPU topology, which results in lengthy diagnostic checks during the GPU health check process. The Prolog, which runs the check before job tasks are started, must complete on all allocated nodes. If one node's GPU health check takes too long, the entire job setup is delayed, causing the "prolog hung" error:

slurmstepd: error: Prolog hung on node xxx

Testing with g4dn and g6 instance types shows that this issue occurs rarely, as these instances have simpler GPU configurations that do not require such long diagnostic times.

Affected Versions

All ParallelCluster 3.6.0+ versions using Slurm scheduler on instance types such as p4d and p5 with GPU health checks enabled are affected. This issue may also occur with other instance types if the GPU health check takes too long.

Mitigation

You can find a detailed explanation and the mitigation of the problem. (3.6.0 ‐ latest) Prolog hangs due to long GPU health check times on certain instance types

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions