-
Notifications
You must be signed in to change notification settings - Fork 135
Description
What you would like to be added?
We'd like to add a PodGrouper plugin for JobSet workloads so it can automatically create PodGroups for gang scheduling. The plugin will create a PodGroup for each Job replica in a JobSet, using the naming pattern pg-<jobset-name>-<replicatedjob-name>-<job-index>-<jobset-uid>. It sets MinAvailable based on spec.replicatedJobs[].template.spec.parallelism. The implementation follows the same pattern as our existing plugins and comes with unit tests.
Why is this needed?
JobSet is pretty commonly used for distributed training workloads, but right now KAI Scheduler's PodGrouper doesn't support it. This means users have to manually create PodGroups or use the default grouper, which doesn't handle JobSet's gang scheduling needs well. With this feature, we can enable automatic gang scheduling for JobSet workloads and avoid sequencing deadlocks by creating separate PodGroups per Job replica. This way users can take advantage of KAI Scheduler's gang scheduling without having to manage PodGroups manually.
cc @romanbaron