-
Notifications
You must be signed in to change notification settings - Fork 505
Minimise Baremetal footprint, live-iso bootstrap #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
--- | ||
title: minimise-baremetal-footprint | ||
authors: | ||
- "@hardys" | ||
reviewers: | ||
- "@avishayt" | ||
- "@beekhof" | ||
- "@crawford" | ||
- "@deads2k" | ||
- "@dhellmann" | ||
- "@hexfusion" | ||
- "@mhrivnak" | ||
approvers: | ||
- "@crawford" | ||
creation-date: "2020-06-04" | ||
last-updated: "2020-06-05" | ||
status: implementable | ||
see-also: compact-clusters | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# Minimise Baremetal footprint | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
Over recent releases OpenShift has improved support for small-footprint | ||
deployments, in particular with the compact-clusters enhancement which adds | ||
full support for 3-node clusters where the masters are schedulable. | ||
|
||
This is a particularly useful deployment option for baremetal PoC environments, | ||
where often the amount of physical hardware is limited, but there is still the | ||
problem of where to run the installer/bootstrap-VM in this environment. | ||
|
||
The current solution for IPI baremetal is to require a 4th bootstrap host, | ||
which is a machine physically connected to the 3 master nodes, that runs | ||
the installer and/or the bootstrap VM. This effectively means the minimum | ||
footprint is 4 nodes, unless you can temporarily connect a provisioning host | ||
to the cluster machines. | ||
|
||
A similar constraint exists for UPI baremetal deployments, where although a | ||
3 master cluster is possible, you need to run a 4th bootstrap node somewhere | ||
for the duration of the initial installation. | ||
|
||
Even in larger deployments, it is not recommended to host the bootstrap or | ||
controlplane services on a host that will later host user workloads, due | ||
to the risk of e.g stale loadbalancer config resulting in controlplane | ||
traffic reaching that node, so potentially you always need an additional | ||
node (which may need to be dedicated per-cluster for production cases). | ||
|
||
## Motivation | ||
|
||
This proposal outlines a potential approach to avoid the requirement for a | ||
4th node, leveraging the recent etcd-operator improvements and work to enable | ||
a live-iso replacement for the bootstrap VM. | ||
hardys marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Goals | ||
|
||
* Enable clusters to be deployed on baremetal with exactly 3 nodes | ||
* Avoid the need for additional nodes to run install/bootstrap components | ||
* Simplify existing IPI baremetal day-1 user experience | ||
|
||
### Non-Goals | ||
|
||
* Supporting any controlplane topology other than three masters. | ||
* Supporting deployment of a single master or scaling from such a deployment. | ||
* Support for pivoting the bootstrap machine to a worker. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see two separate improvements here:
Maybe it's fair to say that (2) is the more interesting new feature, and it requires (1). However, I think we could include a user story for (1) on its own:
i.e. booting an RHCOS live ISO would be easier than installing RHEL on the machine? Not needing a bootstrap VM means the machine you boot with this ISO could be a VM? Related, if you're not installing a "compat 3 node cluster", wouldn't it be better to avoid the bootstrap-pivot-to-master and instead do bootstrap-pivot-to-worker once the install has fully completed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, good points. In the past @crawford has said that pivot-to-worker may have some security risk - Alex can you please comment on that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the issue according to @crawford is that this worker node might still get traffic meant for the API VIP after the pivot, for example do to a load balancer configuration not being updated. Then some malicious actor could run a MCS on that node and load other software onto additional nodes being deployed. I don't think that this would be possible because pods would be on a different network, but I'm not sure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, that sums it up. If we can pivot to a control plane node or a compute node (and we've proven we can), it's easier for us and the customer to just avoid the potential issues and only pivot to the control plane node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That also implies to me that we don't see any downsides to pivoting the bootstrap machine to be a control plane node - like difficulty debugging/recovering if the pivot fails. Is that the case? (i.e. if there are downsides to this bootstrap-to-master pivot, then that would be a reason to choose to pivot to a worker, in cases where we have a choice) In our discussions about running the assisted service and UI in standalone/disconnected cases, we kept coming back to the negative implications of bootstrap-to-master as a reason to not run that stuff on the bootstrap node. That's what got me thinking about whether bootstrap-to-master was a thing we want to use in cases where that's our only choices, or a thing we want to use in all cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. main benefit of bootstrap-to-master is it covers both use cases (with and without workers). If we enable bootstrap-to-worker we are potentially doubling the testing matrix? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I didn't arrive at that conclusion at all while reading this enhancement but it's certainly possible. If we're intending for the live iso run the installer binary and start bootstrapping services while running the live-iso can we detail more of that aspect somewhere? I guess perhaps that came from ambiguity around "installer/bootstrap services" in line 75? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a user story to capture the "improve IPI day-1 experience" and will work on adding some more detail around running the bootstrap services on the live ISO (which is clearly a key part of the implementation here, and has already been prototyped - @avishayt can you perhaps help with some more details? |
||
|
||
As a user of OpenShift, I should be able to install a fully supportable | ||
3-node cluster in baremetal environments, without the requirement to | ||
temporarily connect a 4th node to host installer/bootstrap services. | ||
|
||
As a large-scale production user with multi-cluster deployments I want to avoid | ||
dedicated provisioning nodes per-cluster in addition to the controlplane node | ||
count, and have the ability to redeploy in-place for disaster recovery reasons. | ||
|
||
As an existing user of the IPI baremetal platform, I want to simplify my day-1 | ||
experience by booting a live-ISO for the bootstrap services, instead of a | ||
host with a bootstrap VM that hosts those services. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe Alex mentioned that it isn't good practice to run the bootstrap logic on a host that will become a worker, because the API VIP shouldn't ever be on a host that will later serve user workloads. If that's indeed the case, an extra host is needed for deployments with workers as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess that means we'd always reboot the bootstrap host to be the 3rd master, and never to become a worker? It doesn't necessarily follow that an extra host is required though? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean that we limited this proposal to 3-node clusters, but it's beneficial for any size cluster. Today if you want 3 masters and 2 workers you will need 6 hosts, and with this proposal only 5. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see, thanks - I'll clarify that. Would be interesting to hear from @crawford re the API VIP best practice though, as AFAIK that would work OK with our current keepalived solution (the VIP would just move over to one of the masters as soon as the API comes up there) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the issue according to @crawford is that this worker node might still get traffic meant for the API VIP after the pivot, for example do to a load balancer configuration not being updated. Then some malicious actor could run a MCS on that node and load other software onto additional nodes being deployed. I don't think that this would be possible because pods would be on a different network, but I'm not sure. |
||
### Risks and Mitigations | ||
|
||
This proposal builds on work already completed e.g etcd-operator improvements | ||
but we need to ensure any change in deployment topology is well tested and | ||
fully supported, to avoid these deployments being an unreliable | ||
corner-case. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hexfusion Since the cluster will only be in 2-master mode until the bootstrap host pivots, can this be considered the same way as a master-replacement? I guess in an HA configuration if a master fails, there's no signal to the operator (other than a master going away), so we'd expect etcd things to just work but in a degraded state, until the 3rd master comes up? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hexfusion , @crawford , @sdodson , @ironcladlou , @retroflexer : Is this item about the etcd operator the only open question about this design enhancement? Are we ready to approve it? |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the impact of this change on the current integration with ACM? That expects to use Hive to run the installer in a pod in the ACM cluster, but it seems that will need to change to run something that attaches the live ISO to one of the hosts instead? I don't think we need to work out all of the details here, but we should at least point out that if we make this change in a way that isn't backwards compatible then we will break the existing ACM integration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dhellmann that is a good point - I suspect that means that at least short/medium term the existing bootstrap VM solution would still be required, and we have the same question to answer re any ACM integration with the assisted-install flow? I've been wondering whether this proposal could be simplified by not considering the "run installer on the live-iso" part, and instead prepare a cluster-specific ISO the user can then boot e.g That would still imply two possible install paths though, the bootstrap VM case or the alternative based on creation of a bootstrap ISO without dependencies on libvirt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if anyone has started thinking deeply about integrating the assisted installer with ACM, or what that means. Perhaps the integration is just running the assisted installer on the ACM cluster and linking to it from the ACM GUI? Perhaps it doesn't make sense to integrate the assisted installer with ACM at all, since the assisted installer doesn't automate things like the IPI installer does and the point of ACM is to have that automation? ACM assumes the installer manages the entire process. If we change the installer to generate an ISO to replace the bootstrap VM, then we would have to do something somewhere to attach that ISO to the host and boot it. I think to accomplish that, we would end up moving a lot of the features of the IPI installer into some new controller in ACM, and in the process we might also end up with a different path for ACM's integration with the installer because Hive wouldn't know how to drive the tool to mount the ISO. So as far as I can tell, we're going to have 2 paths somewhere, regardless of what we do. |
||
## Design Details | ||
|
||
### Enabling three-node clusters on baremetal | ||
|
||
OpenShift now provides a bootable RHCOS based installer ISO image, which can | ||
be booted on baremetal, and adapted to install the components normally | ||
deployed on the bootstrap VM. | ||
|
||
This means we can run the bootstrap services in-place on one of the target hosts | ||
which we can later reboot to become a master (referred to as master-0 below). | ||
|
||
While the master-0 is running the bootstrap services, the two additional hosts | ||
are then provisioned, either with a UPI-like boot-it-yourself method, or via a | ||
variation on the current IPI flow where the provisioning components run on | ||
master-0 alongside the bootstrap services (exactly like we do today on the | ||
bootstrap VM). | ||
|
||
When the two masters have deployed, they form the initial OpenShift controlplane | ||
and master-0 then reboots to become a regular master. At this point it joins | ||
the cluster and bootstrapping is complete, and the result is a full-HA 3-master | ||
deployment without any dependency on a 4th provisioning host. | ||
|
||
Note that we will not support pivot of the initial node to a worker role, since | ||
there is concern that network traffic e.g to the API VIP should never reach | ||
a worker node, and there could be a risk e.g if an external load balancer config | ||
was not updated of this happening if the bootstrap host is allowed to pivot | ||
to a worker. | ||
|
||
|
||
### Test Plan | ||
|
||
We should test in baremetal (or emulated baremetal) environments with 3-node | ||
clusters with machines that represent our minimum target and ensure our e2e | ||
tests operate reliably with this new topology. | ||
|
||
We should add testing of the controlplane scaling/pivot (not necessarily on | ||
baremetal) to ensure this is reliable. It may be this overlaps with some | ||
existing master-replacement testing? | ||
|
||
### Graduation Criteria | ||
|
||
TODO | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
This is an install-time variation so no upgrade/downgrade impact. | ||
|
||
## Implementation History | ||
|
||
TODO links to existing PoC code/docs/demos | ||
|
||
## Drawbacks | ||
|
||
The main drawback of this approach is it requires a deployment topology and | ||
hardys marked this conversation as resolved.
Show resolved
Hide resolved
|
||
controlplane scaling which is not likely to be adopted by any of the existing | ||
hardys marked this conversation as resolved.
Show resolved
Hide resolved
|
||
cloud platforms, thus moves away from the well-tested path and increases the | ||
risk of regressions and corner-cases not covered by existing platform testing. | ||
|
||
Relatedly it seems unlikely that existing cloud-platforms would adopt this | ||
approach, since creating the bootstrap services on a dedicated VM is easy | ||
in a cloud environment, and switching to this solution could potentially | ||
add walltime to the deployment (the additional time for the 3rd master to | ||
pivot/reboot and join the cluster). | ||
|
||
## Alternatives | ||
|
||
One possible alternative is to have master-0 deploy a single-node controlplane | ||
then provision the remaining two hosts. This idea has been rejected as it | ||
is likely more risky trying to scale from 1->3 masters than establishing | ||
initial quorum with a 2-node controlplane, which should be similar to the | ||
degraded mode when any master fails in a HA deployment, and thus a more | ||
supportable scenario. |
Uh oh!
There was an error while loading. Please reload this page.