Skip to content

Fix controller-manager failures causing CNS to remain incorrectly healthy #3688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Copilot
Copy link

@Copilot Copilot AI commented May 27, 2025

Problem

Currently, CNS will repeatedly try to start controller-manager even if it encounters terminal errors, keeping CNS in an incorrectly healthy state. This affects both controller-manager in CRD mode and multiTenantController in MultiTenantCRD mode.

Solution

This PR adds timeouts for controller startup and proper error handling to ensure CNS fails appropriately when controllers can't be started. Key changes:

  1. Added timeout contexts for both controller-manager and multiTenantController with a 5-minute default
  2. Added new metrics to track timeout occurrences:
    • cns_ctrlmanager_start_timeouts_total
    • cns_multitenantcontroller_start_timeouts_total
  3. Made CNS handle terminal errors more gracefully by propagating errors through channels instead of calling os.Exit directly from goroutines
  4. Simplified goroutine structure for starting controllers and improved error reporting
  5. Extracted common functionality to reusable functions for better maintainability

Before

  • CNS would indefinitely retry controller-manager.Start on failures
  • No timeout was set for controller startup attempts
  • Terminal errors wouldn't cause CNS to fail appropriately

After

  • Controller startup attempts have a reasonable timeout
  • Terminal errors are properly propagated and handled
  • CNS will report unhealthy status when controllers fail to start

These changes ensure CNS will not remain in an incorrectly healthy state when controllers fail to initialize properly.

Fixes #3687.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Failures in controller-manager leave CNS in an incorrectly healthy state Fix controller-manager failures causing CNS to remain incorrectly healthy May 27, 2025
@Copilot Copilot AI requested a review from timraymond May 27, 2025 20:07
Copilot finished work on behalf of timraymond May 27, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failures in controller-manager leave CNS in an incorrectly healthy state
2 participants