-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ElasticSearch statefulset broken after 1 hour #192
Comments
My cluster never recovered. There is some deadlock occurring related to EBS persistent volumes and the autoscaler. I ran out of time and energy investigating but had to press on so I did
Feel free to close this issue if you think it's a rare event or statefulset bug. |
I'm not 100% sure but I think my problem is fixed in PR kubernetes/kubernetes#46463 :
Which I found from the Kuberbetes v1.7.0-beta.1 CHANGELOG. So hopefully coming 28/Jun/17. |
There was some issues regarding etcd version as well. The current version 3.0.10 is problematic. Would you add an entry to update your etcd version as well?
like this. |
I too have problems with stateful sets described at #185 but i think thats a Kubernetes problems, regarding the autoscaler i had problems with it too so i just disabled it entirely . |
I brought up a Kubernetes cluster with Tack in an existing VPC in us-east-1 and all was good until suddenly the first pod in the ElasticSearch
StatefulSet
was killed.I confirmed from
CloudTrail
and the ASGActivity History
that the autoscaler had removed a Worker which, by chance, had a ElasticSearch pod on it. I can see the 25G EBS volume that theStatefulSet volumeClaimTemplates
had provisioned, is now unattached.The second ElasticSearch pod, was assigned to a master node, so is unaffected by scaling events. One solution would be to force both ElasticSearch pods to use master nodes.
Here we see the statefulset is broken:
The
elasticsearch-logging-1
pod exists but theelasticsearch-logging-0
pod is missing:This command explains the cause of the failure, i.e it's trying to attach a ebs volume to a now non-existent node:
This command shows there is some problem deleting the node (even though it does not show up in
kubectl get nodes
):(FYI: I think this log spam is a separate issue fixed in kubernetes/kubernetes#45923)
Checking the autoscaler info also shows there's still 6 registered nodes:
I'm not sure how to tell kubernetes to truly forget the old ip-10-56-0-138 worker node or to stop trying to mount the volume to instance that doesn't exist.
The text was updated successfully, but these errors were encountered: