(3.6.0 - latest) Prolog hangs due to long GPU health check times on certain instance types

## The Issue

AWS ParallelCluster 3.6.0 and later, when configured with GPU health checks, may experience delays and eventual "prolog hung" errors when using instance types like p4d and p5. These instances have a complex GPU topology, which results in lengthy diagnostic checks during the GPU health check process. The Prolog, which runs the check before job tasks are started, must complete on all allocated nodes. If one node's GPU health check takes too long, the entire job setup is delayed, causing the "prolog hung" error:
```
slurmstepd: error: Prolog hung on node xxx
```
Testing with g4dn and g6 instance types shows that this issue occurs rarely, as these instances have simpler GPU configurations that do not require such long diagnostic times.

## Affected Versions
All ParallelCluster 3.6.0+ versions using Slurm scheduler on instance types such as p4d and p5 with [GPU health checks enabled](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-HealthChecks-Gpu-Enabled) are affected. This issue may also occur with other instance types if the GPU health check takes too long.

## Mitigation
You can find a detailed explanation and the mitigation of the problem. [(3.6.0 ‐ latest) Prolog hangs due to long GPU health check times on certain instance types](https://github.com/aws/aws-parallelcluster/wiki/(3.6.0-%E2%80%90-latest)-Prolog-hangs-due-to-long-GPU-health-check-times-on-certain-instance-types)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(3.6.0 - latest) Prolog hangs due to long GPU health check times on certain instance types #6777

The Issue

Affected Versions

Mitigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

(3.6.0 - latest) Prolog hangs due to long GPU health check times on certain instance types #6777

Description

The Issue

Affected Versions

Mitigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions