Skip to content

Conversation

Micky-Yang
Copy link

@Micky-Yang Micky-Yang commented Oct 10, 2025

In my actual reference documents, during the practise, I found the following problems:

  1. Helm chart repository name conflict related to Nvidia
helm repo add nvidia https://nvidia.github.io/k8s-device-plugin
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo add nvidia https://nvidia.github.io/gpu-operator

All helm repos are named nvidia, which causes repo name conflicts. Different repo names have been used to distinguish them.

  1. Automatic installation instructions for the Nvidia device plugin service

  2. pending issues of coredns and metrics-server services
    The Pod cannot be scheduled because the default gpu-dra-nodes adds taints:

kube-system   coredns-7bf648ff5d-4bs45               0/1     Pending   0          12m
kube-system   coredns-7bf648ff5d-n4bwq               0/1     Pending   0          12m
kube-system   metrics-server-7fb96f5556-4cpdl        0/1     Pending   0          12m
kube-system   metrics-server-7fb96f5556-6mbvh        0/1     Pending   0          12m

So a base-nodes NodeGroup was added to fix this problem.

  1. Pod Pending issue caused by mismatch between Pod nodeSelector label and NodeGroup label key

The Pod nodeSelector label is NodeGroupType: gpu-dra, but NodeGroup label is node-type: "gpu-dra", so change the Pod nodeSelector label to node-type: "gpu-dra".

@Micky-Yang Micky-Yang requested a review from a team as a code owner October 10, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant