Multiple failures during bare metal SNO installation #7223

Levovar · 2025-01-23T19:26:54Z

I'm following guides https://github.com/openshift/assisted-service/blob/master/docs/hive-integration/kube-api-getting-started.md and https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12/html/clusters/cluster_mce_overview#create-intro, but I've encountered so many different problems I'm not sure if SNO installation is even supposed to work.
I must be doing something fundamentally wrong so appreciate any inputs here!

Context:

AIS version is whatever is installed by default with MCE 2.7.2
OS image is 4.17.2
release image is 4.17.13
Note: all the listed issues were present in 2.6.X / 4.16.X as well.

Pre-requisites
I create all the required CRs mentioned in the documentation, boot the full discovery image myself (because baremetal operator Ironic also doesn't work but that's a whole other can of worms)
Node boots up, Agent CR is created in my cluster, so far so good. Then...

Problems
1. "The host name localhost.localdomain is forbidden"
Installation first fails on this Agent validation, baseline Ignition config embedded by the agent image service doesn't seem to set hostname yet there is a mandatory validation for it in the agent.
Who is supposed to set this? Am I missing some configuration?

2. Hostname validation rule is not periodically re-checked
Well, I guess I will set hostname myself for now to be able to proceed.
However, unlike NTP synchronization related problems -which do eventually resolve on their own- the hostname validation failure never goes away once triggered.
I have waited for an hour. Then I approved the Agent. Then I even tried to manually delete the related status fields from the CR, all to no avail.

3. Bootkube service doesn't seem to inherit HTTP/S proxy settings from InfraEnv
Well, I guess I will "sudo sysctemctl restart agent" through SSH for now to be able to proceed.
Installation progresses to "waiting for bootkube" state, however bootstrapping fails with a timeout. Upon closer inspection it turns out the release images exclusively used by the bootkube service cannot be pulled because the service could not reach quay.io registry. Which is kinda strange considering the agent and the release image service previously both could access the internet.
Upon closely inspecting the bootkube service it seems the HTTP_PROXY, HTTPS_PROXY, and NO_PROXY settings are not inherited by this unit from the InfraEnv CR (neither are these set into Podman's proxy setting). For reference the agent service explicitly includes these settings in its unit file.
Who is responsible for making sure bootkube service is operational behind an HTTP proxy, and where should I put these proxy settings in addition to the InfraEnv CR?

4. Bootkube bootstrapping script has some interesting logic for SNO
Well, I guess I will copy the proxy settings from the agent service into bootkube service file for now to be able to proceed.
Images are successfully getting pulled, boostrapping manifests are getting generated, I'm goaded into believing this is it.
But alas it was not meant to be: the service eventually times out waiting for etcd to come up. Specifically, this part seems to be failing:


# in case of single node, if we removed etcd, there is no point to wait for it on restart
if [ ! -f stop-etcd.done ]
then
    record_service_stage_start "wait-for-etcd"
    # Wait for the etcd cluster to come up.
    wait_for_etcd_cluster
    record_service_stage_success
fi

I think I understand the intention however this is getting executed during the initial run as well. Right after template generation, right before the following block:

if [ "$BOOTSTRAP_INPLACE" = true ]
then
    REQUIRED_PODS=""
fi
 
echo "Starting cluster-bootstrap..."
run_cluster_bootstrap() {
    record_service_stage_start "cb-bootstrap"
    bootkube_podman_run \

So obviously the etcd health check will perpetually fail, simply because there is no etcd running.
If I provision the done file cluster bootstrap also fails because REQUIRED_PODS is empty, so nothing comes up, 20 minutes timeout waiting for the API server fires and service is restarted

This is where I ultimately gave up. So again, appreciate any suggestions as to what am I missing, because verbatim executing the procedure described in the documentation doesn't seem to produce a working environment on a bare metal SNO.

The text was updated successfully, but these errors were encountered:

Levovar · 2025-01-23T22:53:38Z

for the next generation: following https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12/html/clusters/cluster_mce_overview#config-agent-proxy seems to solve problems no. 3 and no. 4. the documentation in this repository could be updated to mention proxy settings must be set into both InfraEnv and AgentClusterInstall

problems no. 1 and no. 2 still persist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple failures during bare metal SNO installation #7223

Multiple failures during bare metal SNO installation #7223

Levovar commented Jan 23, 2025 •

edited

Loading

Levovar commented Jan 23, 2025 •

edited

Loading

Multiple failures during bare metal SNO installation #7223

Multiple failures during bare metal SNO installation #7223

Comments

Levovar commented Jan 23, 2025 • edited Loading

Levovar commented Jan 23, 2025 • edited Loading

Levovar commented Jan 23, 2025 •

edited

Loading

Levovar commented Jan 23, 2025 •

edited

Loading