Skip to content

Ensure process groups are removed from the pending restart list if they are stuck in terminating or the process is missing #2325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

johscheuer
Copy link
Member

Description

In our e2e tests we have seen a few cases where the process groups were not removed if they are stuck in terminating. If the process is missing or stuck in terminating we should remove it from the pending restart set.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Discussion

Testing

Ran the unit tests, CI will run the e2e tests.

Documentation

Follow-up

@johscheuer johscheuer requested a review from nicmorales9 July 15, 2025 07:41
@johscheuer johscheuer added the bug Something isn't working label Jul 15, 2025
@@ -1758,7 +1758,7 @@ var _ = Describe("Operator", Label("e2e", "pr"), func() {
fdbCluster.GetCluster().Status.ProcessGroups,
processGroupID,
)
}).WithTimeout(5 * time.Minute).WithPolling(5 * time.Second).Should(BeNil())
}).WithTimeout(10 * time.Minute).WithPolling(5 * time.Second).Should(BeNil())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the wait time here. In cases where the global sync mode is enabled, the coordination takes a bit longer, so the actual replacement often takes a bit longer than 5 min (from the last 2 failed tests it's often around 6min)

@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: 794a571
  • Duration 3:59:14
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: 5121ce5
  • Duration 3:39:10
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: f2b5a8b
  • Duration 4:14:57
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this Jul 16, 2025
@johscheuer johscheuer reopened this Jul 16, 2025
@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: f2b5a8b
  • Duration 4:08:09
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer force-pushed the fixes-global-coordination branch from f2b5a8b to 19b7b77 Compare July 16, 2025 08:38
@foundationdb-ci
Copy link
Contributor

Result of fdb-kubernetes-operator-pr on Linux RHEL 9

  • Commit ID: 19b7b77
  • Duration 3:11:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants