Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to watch *v1.VolumeAttachment #7663

Open
madchap opened this issue Jan 6, 2025 · 9 comments
Open

Failed to watch *v1.VolumeAttachment #7663

madchap opened this issue Jan 6, 2025 · 9 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@madchap
Copy link

madchap commented Jan 6, 2025

Which component are you using?: cluster-autoscaler on AWS

/area cluster-autoscaler

What version of the component are you using?: 9.45

Component version: Helm chart 9.45

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.3-eks-56e63d8

What environment is this in?: AWS EKS

What did you expect to happen?: I am trying to figure out why the autoscaler does not honor my --ok-total-unready-count=0. It seems the node that enters the NotReady state is stuck with many terminating pods, and I observed at the same time the error in the autoscaler log.

The error is the following:

failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

When looking at the clusterrole created by the helm chart, I am not seeing this particular resource:

$ k describe clusterrole cluster-autoscaler-aws-cluster-autoscaler
Name:         cluster-autoscaler-aws-cluster-autoscaler
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=aws-cluster-autoscaler
              helm.sh/chart=cluster-autoscaler-9.45.0
Annotations:  meta.helm.sh/release-name: cluster-autoscaler
              meta.helm.sh/release-namespace: kube-system
PolicyRule:
  Resources                            Non-Resource URLs  Resource Names        Verbs
  ---------                            -----------------  --------------        -----
  endpoints                            []                 []                    [create patch]
  events                               []                 []                    [create patch]
  pods/eviction                        []                 []                    [create]
  leases.coordination.k8s.io           []                 []                    [create]
  jobs.extensions                      []                 []                    [get list patch watch]
  endpoints                            []                 [cluster-autoscaler]  [get update]
  leases.coordination.k8s.io           []                 [cluster-autoscaler]  [get update]
  configmaps                           []                 []                    [list watch get]
  pods/status                          []                 []                    [update]
  nodes                                []                 []                    [watch list create delete get update]
  jobs.batch                           []                 []                    [watch list get patch]
  namespaces                           []                 []                    [watch list get]
  persistentvolumeclaims               []                 []                    [watch list get]
  persistentvolumes                    []                 []                    [watch list get]
  pods                                 []                 []                    [watch list get]
  replicationcontrollers               []                 []                    [watch list get]
  services                             []                 []                    [watch list get]
  daemonsets.apps                      []                 []                    [watch list get]
  replicasets.apps                     []                 []                    [watch list get]
  statefulsets.apps                    []                 []                    [watch list get]
  cronjobs.batch                       []                 []                    [watch list get]
  daemonsets.extensions                []                 []                    [watch list get]
  replicasets.extensions               []                 []                    [watch list get]
  csidrivers.storage.k8s.io            []                 []                    [watch list get]
  csinodes.storage.k8s.io              []                 []                    [watch list get]
  csistoragecapacities.storage.k8s.io  []                 []                    [watch list get]
  storageclasses.storage.k8s.io        []                 []                    [watch list get]
  poddisruptionbudgets.policy          []                 []                    [watch list]

I am not sure, but given the --ok-total-unready-count=0, I would expect the node which enters the NotReady state to be fairly quickly replaced by a node that can handle things.

What happened instead?:
The NotReady node sticks around for quite some time, with bunch of pods in Terminating state. Eventually, it'll go away after some time (maybe 30-45mn).

How to reproduce it (as minimally and precisely as possible):
Something is causing my node to get to NotReady state, I think way too much over-committment on them, especially on memory (then the kubelet then bails out).

I am afraid I can't :-/

Anything else we need to know?:

An log iteration where I see the volumeattachment error:

I0106 17:52:52.606768       1 static_autoscaler.go:274] Starting main loop
I0106 17:52:52.609136       1 aws_manager.go:188] Found multiple availability zones for ASG "eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f"; using eu-central-2b for failure-domain.beta.kubernetes.io/zone label
I0106 17:52:52.758096       1 filter_out_schedulable.go:65] Filtering out schedulables
I0106 17:52:52.758116       1 filter_out_schedulable.go:122] 0 pods marked as unschedulable can be scheduled.
I0106 17:52:52.758125       1 filter_out_schedulable.go:85] No schedulable pods
I0106 17:52:52.758130       1 filter_out_daemon_sets.go:47] Filtered out 0 daemon set pods, 0 unschedulable pods left
I0106 17:52:52.758150       1 static_autoscaler.go:532] No unschedulable pods
I0106 17:52:52.758168       1 static_autoscaler.go:555] Calculating unneeded nodes
I0106 17:52:52.758182       1 pre_filtering_processor.go:67] Skipping ip-10-0-12-37.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758204       1 pre_filtering_processor.go:67] Skipping ip-10-0-28-107.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758209       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-38.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758213       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-82.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758473       1 static_autoscaler.go:598] Scale down status: lastScaleUpTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownDeleteTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownFailTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 scaleDownForbidden=false scaleDownInCooldown=true
I0106 17:52:52.759061       1 orchestrator.go:322] ScaleUpToNodeGroupMinSize: NodeGroup eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f, TargetSize 3, MinSize 3, MaxSize 5
I0106 17:52:52.759135       1 orchestrator.go:366] ScaleUpToNodeGroupMinSize: scale up not needed
I0106 17:52:56.201819       1 reflector.go:349] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
W0106 17:52:56.206308       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
E0106 17:52:56.206341       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.VolumeAttachment: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cluster-autoscaler\" cannot list resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope" logger="UnhandledError"
I0106 17:52:57.975501       1 reflector.go:879] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Watch close - *v1.Node total 29 items received
@madchap madchap added the kind/bug Categorizes issue or PR as related to a bug. label Jan 6, 2025
@Tarasovych
Copy link

Same issue with chart 9.45.0, image 1.32.0

@eon01
Copy link

eon01 commented Jan 7, 2025

Same problem with DigitalOcean / chart 9.45.0 / image 1.32.0.

Adding the necessary permissions for volumeattachments to the ClusterRole bound to the autoscaler service account fixes the problem.

Run:

kubectl edit clusterrole <autoscaler-cluster-role-name>

Then update the YAML and add volumeattachments:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
[...]
rules:
[...]
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - csinodes
  - csidrivers
  - csistoragecapacities
  - volumeattachments # <== This
  verbs:
  - watch
  - list
  - get
[...]

@devops-cafex
Copy link

Seeing the same in our eks cluster missing volumeattachements reasources from the clusterrole

@omers
Copy link

omers commented Jan 12, 2025

Same for me, with a fresh installed EKS 1.31

@JnMik
Copy link

JnMik commented Jan 14, 2025

Same issue here

@devops-cafex
Copy link

Latest chart 9.45.1 on EKS 1.31 has the same issue

@TomKeyte
Copy link

Same issue :/

@ScottKinder
Copy link

I am also seeing this issue in the most current Helm chart as of today, 9.45.1.

@TomKeyte
Copy link

I fixed this by downgrading to 9.44.0 based on this table:

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

9 participants