Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e] Flaky e2e_import_gitops_v3 job #976

Open
anmazzotti opened this issue Jan 6, 2025 · 3 comments
Open

[e2e] Flaky e2e_import_gitops_v3 job #976

anmazzotti opened this issue Jan 6, 2025 · 3 comments
Assignees
Labels
area/testing Indicates an issue related to test status/waiting-for-upstream

Comments

@anmazzotti
Copy link
Contributor

anmazzotti commented Jan 6, 2025

What steps did you take and what happened?

I have two instances of failing waiting for cluster deletion:

   Timeline >>
  STEP: Creating a namespace for hosting the "creategitops-v3" test spec @ 01/03/25 09:24:39.627
  INFO: Creating namespace creategitops-v3-jjycsr
  INFO: Creating event watcher for namespace "creategitops-v3-jjycsr"
  STEP: Create Git repository @ 01/03/25 09:24:42.876
  STEP: Create fleet repository structure @ 01/03/25 09:24:45.011
  STEP: Committing changes to fleet repo and pushing @ 01/03/25 09:24:45.011
  STEP: Applying GitRepo @ 01/03/25 09:24:46.49
  STEP: Creating GitRepo from template @ 01/03/25 09:24:46.49
  STEP: Applying GitRepo @ 01/03/25 09:24:46.49
  STEP: Waiting for the CAPI cluster to appear @ 01/03/25 09:24:46.663
  STEP: Waiting for cluster control plane to be Ready @ 01/03/25 09:25:16.714
  STEP: Waiting for the CAPI cluster to be connectable @ 01/03/25 09:33:17.071
  STEP: Storing the original CAPI cluster kubeconfig @ 01/03/25 09:33:17.132
  STEP: Getting Rancher kubeconfig secret @ 01/03/25 09:33:17.133
  STEP: Loading secret data into kubeconfig @ 01/03/25 09:33:17.144
  STEP: Writing original kubeconfig to temp file /tmp/kubeconfig-original598099136 @ 01/03/25 09:33:17.144
  STEP: Running checks on Rancher cluster @ 01/03/25 09:33:17.145
  STEP: Waiting for the rancher cluster record to appear @ 01/03/25 09:33:17.145
  STEP: Waiting for the rancher cluster to have a deployed agent @ 01/03/25 09:33:17.169
  STEP: Waiting for the rancher cluster to be ready @ 01/03/25 09:34:17.2
  STEP: Waiting for the rancher cluster to be ready @ 01/03/25 09:35:47.25
  STEP: Rancher cluster should have the 'NoCreatorRBAC' annotation @ 01/03/25 09:35:47.25
  STEP: Waiting for the CAPI cluster to be connectable using Rancher kubeconfig @ 01/03/25 09:35:47.26
  STEP: Getting Rancher kubeconfig secret @ 01/03/25 09:35:47.26
  STEP: Loading secret data into kubeconfig @ 01/03/25 09:35:47.273
  STEP: Writing updated kubeconfig to temp file /tmp/kubeconfig3604869300 @ 01/03/25 09:35:47.274
  STEP: Deleting GitRepo from Rancher @ 01/03/25 09:36:55.191
  STEP: Getting GitRepo from cluster @ 01/03/25 09:36:55.191
  STEP: Deleting GitRepo from cluster @ 01/03/25 09:36:55.209
  STEP: Waiting for the rancher cluster record to be removed @ 01/03/25 09:36:55.255
  STEP: Deleting cluster creategitops-v3-jjycsr/clusterv3-gke @ 01/03/25 09:37:25.291
  STEP: Deleting cluster creategitops-v3-jjycsr/clusterv3-gke @ 01/03/25 09:37:25.314
  INFO: Waiting for the Cluster creategitops-v3-jjycsr/clusterv3-gke to be deleted
  STEP: Waiting for cluster creategitops-v3-jjycsr/clusterv3-gke to be deleted @ 01/03/25 09:37:25.343
  [FAILED] in [AfterEach] - /home/ghr/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:180 @ 01/03/25 10:07:25.343
  << Timeline

  [FAILED] Timed out after 1800.000s.
  waiting for cluster deletion timed out
  Expected
      <bool>: false
  to be true
  In [AfterEach] at: /home/ghr/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:180 @ 01/03/25 10:07:25.343

  Full Stack Trace
    sigs.k8s.io/cluster-api/test/framework.WaitForClusterDeleted(***0x2dada70, 0x45bed60***, ***0x2dc56a0, 0xc000972d80***, 0xc0004ad808, ***0x0, 0x0***, ***0xc0009414a0, 0x2, 0x2***)
    	/home/ghr/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:180 +0x298
    sigs.k8s.io/cluster-api/test/framework.DeleteAllClustersAndWait(***0x2dada70, 0x45bed60***, ***0x2dc56a0, 0xc000972d80***, ***0xc0007a8de0, 0x16***, ***0x0, 0x0***, ***0xc0009414a0, 0x2, ...***)
    	/home/ghr/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:292 +0x3f6
    github.com/rancher/***/test/e2e.DumpSpecResourcesAndCleanup(***0x2dada70, 0x45bed60***, ***0x29a1f0d, 0xf***, ***0x2dc41f8, 0xc0001ba310***, ***0x0?, 0x0?***, 0xc0007de580, 0xc0001dba00, ...)
    	/home/ghr/_work/***/***/test/e2e/helpers.go:72 +0x19a
    github.com/rancher/***/test/e2e/specs.CreateMgmtV3UsingGitOpsSpec.func4()
    	/home/ghr/_work/***/***/test/e2e/specs/import_gitops_mgmtv3.go:387 +0x605

Browsing the job artifacts I can see the cluster has a deletion timestamp, but something prevents the actual deletion. Does not seem to be a timeout issue. Can be reproduced fairly often.

@anmazzotti
Copy link
Contributor Author

Possibly duplicate of #990

@salasberryfin
Copy link
Contributor

This is most likely an issue on CAPG. GKE clusters are subject to updates initiated by GCP and the controller fails to correctly identify them and tries to perform a different operation on the cluster, causing an incompatible operation error. It could happen on creation and deletion and the controller eventually becomes healthy after some time but this may take too long.

The issue I'm referring to is kubernetes-sigs/cluster-api-provider-gcp#1363. A fixed has already been merged but we're waiting for it to be released with next minor version.

@salasberryfin
Copy link
Contributor

A new patch release v1.8.1 of CAPG will be released. Bumping this in Turtles once available should help resolve this intermittent issue. I'll take this for the time being as I'm helping to get the release out.

@salasberryfin salasberryfin self-assigned this Jan 15, 2025
@salasberryfin salasberryfin moved this from Blocked to In Progress (8 max) in CAPI & Hosted Kubernetes providers (EKS/AKS/GKE) Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Indicates an issue related to test status/waiting-for-upstream
Projects
Development

No branches or pull requests

3 participants