Skip to content

Cluster Erroneously Stuck in Failed State #2146

@spjmurray

Description

@spjmurray

/kind bug

What steps did you take and what happened:

Just checking the state of things in ArgoCD and noted my cluster was in the red. Boo! On further inspection I can see:

  failureMessage: >-
    Failure detected from referenced resource
    infrastructure.cluster.x-k8s.io/v1beta1, Kind=OpenStackCluster with name
    "cluster-bc3d5fc1": failed to reconcile external network: failed to get
    external network: Get
    "https://compute.sausage.cloud:9696/v2.0/networks/5617d17e-fdc1-4aa1-a14b-b9b5136c65af":
    dial tcp: lookup compute.sausage.cloud on 10.96.1.35:53: server misbehaving
  failureReason: UpdateError
  infrastructureReady: true
  observedGeneration: 2
  phase: Failed

but there is no such failure message attached to the OSC resource, so I'm figuring CAPO did sort itself out eventually. I'll just edit the resource, says I, and set the phase (didn't Kubernetes deem such things in the API a total fail?) back to Provisioned and huzzah. But that didn't work and it magically re-appeared from somewhere, I have no idea how this is even possible, but I digress...

According to kubernetes-sigs/cluster-api#10847 CAPO should only ever set these things if something is terminal, and DNS failure quite frankly isn't, specially if you are a road warrior, living Max Max style like some Antipodean Adonis where Wifi is always up and down.

What did you expect to happen:

Treat this error as transient.

Anything else you would like to add:

Just basically reaching out for discussion before I delve into the code, it may be known about, fixed. As always you may have opinions on how this could be fixed. Logically:

var derr *net.DNSError

if errors.As(err, &derr) {
  // handle gracefully
}

should be the simple solution, depending on how well errors are propagated from Gophercloud, which is another story entirely.

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): 0.10.3
  • Cluster-API version: 1.7.2
  • OpenStack version: n/a
  • Minikube/KIND version: n/a
  • Kubernetes version (use kubectl version): n/a
  • OS (e.g. from /etc/os-release): n/a

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions