Skip to content

WIP: Improve CAPZ Bootstrapping Extension's error message #5509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

willie-yao
Copy link
Contributor

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
This PR adds a more descriptive error message for the CAPZ Bootstrapping Extension so that users can better understand what went wrong when the extension fails to install.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Improve CAPZ Bootstrapping Extension's error message

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from willie-yao. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 24, 2025
Copy link

codecov bot commented Mar 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.57%. Comparing base (1abf6ab) to head (a6117b3).
Report is 34 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5509   +/-   ##
=======================================
  Coverage   52.57%   52.57%           
=======================================
  Files         272      272           
  Lines       29470    29470           
=======================================
  Hits        15495    15495           
  Misses      13167    13167           
  Partials      808      808           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -112,7 +112,7 @@ const (

var (
// LinuxBootstrapExtensionCommand is the command the VM bootstrap extension will execute to verify Linux nodes bootstrap completes successfully.
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then echo 'Error joining node to cluster: kubeadm init failed. To debug, check the cloud-init, kubelet, or other bootstrap logs: https://capz.sigs.k8s.io/self-managed/troubleshooting.html?highlight=kubeadmcontrolplane#checking-cloud-init-logs-ubuntu.'; exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to say "kubeadm init or join", as this error could occur on either the first control plane node (init) or any other node that joins the cluster afterwards

@willie-yao willie-yao force-pushed the improve-bootstrap-extension-logs branch from 9cede97 to a6117b3 Compare March 25, 2025 17:10
@willie-yao
Copy link
Contributor Author

/retest

@@ -112,7 +112,7 @@ const (

var (
// LinuxBootstrapExtensionCommand is the command the VM bootstrap extension will execute to verify Linux nodes bootstrap completes successfully.
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then echo 'Error joining node to cluster: kubeadm init or join failed. To debug, check the cloud-init, kubelet, or other bootstrap logs: https://capz.sigs.k8s.io/self-managed/troubleshooting.html?highlight=kubeadmcontrolplane#checking-cloud-init-logs-ubuntu.'; exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure of the helpfulness of this message.

  • Unless the verbosity of kubeadm init/join is increased by --v=<any number greater than 5> users do not have a way to debug the issue.
  • Where else do we expect to see this message ? Is it available on the AzureMachine.Status ?

Suggestions:

  • Should we be renaming #checking-cloud-init-logs-ubuntu to checking-cloud-init-logs-linux since Linux is more generic ?
  • Can we paraphrase the error message from Error joining node to cluster: kubeadm init or join failed...... to Error: kubeadm init or join failed. Refer: https://capz.sigs.k8s.io/self-managed/troubleshooting.html?highlight=kubeadmcontrolplane#checking-cloud-init-logs-ubuntu for more guidance on debugging this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions! Gonna work on modifying it now...

Unless the verbosity of kubeadm init/join is increased by --v=<any number greater than 5> users do not have a way to debug the issue.

Is that true? If so, is there a way to change that from CAPZ? I've debugged bootstrap failures previously with these methods.

Where else do we expect to see this message ? Is it available on the AzureMachine.Status ?

I think so but I'm not 100% sure. It should show up in the CAPZ controller logs for sure

@willie-yao
Copy link
Contributor Author

As per discussion with @jackfrancis, I'll work on potentially changing the script to search the cloud init logs automatically for any kind of error, and displaying that to the user.

@willie-yao willie-yao changed the title Improve CAPZ Bootstrapping Extension's error message WIP: Improve CAPZ Bootstrapping Extension's error message Mar 28, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants