Skip to content

Add JobSet Support to PodGrouper #763

@rich7420

Description

@rich7420

What you would like to be added?

We'd like to add a PodGrouper plugin for JobSet workloads so it can automatically create PodGroups for gang scheduling. The plugin will create a PodGroup for each Job replica in a JobSet, using the naming pattern pg-<jobset-name>-<replicatedjob-name>-<job-index>-<jobset-uid>. It sets MinAvailable based on spec.replicatedJobs[].template.spec.parallelism. The implementation follows the same pattern as our existing plugins and comes with unit tests.

Why is this needed?

JobSet is pretty commonly used for distributed training workloads, but right now KAI Scheduler's PodGrouper doesn't support it. This means users have to manually create PodGroups or use the default grouper, which doesn't handle JobSet's gang scheduling needs well. With this feature, we can enable automatic gang scheduling for JobSet workloads and avoid sequencing deadlocks by creating separate PodGroups per Job replica. This way users can take advantage of KAI Scheduler's gang scheduling without having to manage PodGroups manually.

cc @romanbaron

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions