Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-17771: Adds enhancement for FIPS with multiple RHEL installer versions #6290

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

carbonin
Copy link
Member

@carbonin carbonin commented May 8, 2024

In order for an OpenShift cluster to be considered FIPS compliant the installer must be run on a system with FIPS compliant openssl libraries. This means using a dynamically linked openshift-install binary against the openssl libraries present on our container image. Today this is not a problem because all openshift-install binaries in use have been expecting to link to RHEL 8 based openssl libraries, but now OpenShift 4.16 will ship and installer that requires RHEL 9 libraries.

This will require assisted-service to maintain a way to run the openshift-install binary in a compatible environment for multiple openssl versions.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-17771

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see [CONTRIBUTING] guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 8, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 8, 2024

@carbonin: This pull request references MGMT-17771 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

In order for an OpenShift cluster to be considered FIPS compliant the installer must be run on a system with FIPS compliant openssl libraries. This means using a dynamically linked openshift-install binary against the openssl libraries present on our container image. Today this is not a problem because all openshift-install binaries in use have been expecting to link to RHEL 8 based openssl libraries, but now OpenShift 4.16 will ship and installer that requires RHEL 9 libraries.

This will require assisted-service to maintain a way to run the openshift-install binary in a compatible environment for multiple openssl versions.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-17771

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see [CONTRIBUTING] guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 8, 2024
Copy link

openshift-ci bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carbonin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024
## Summary

In order for an OpenShift cluster to be considered FIPS compliant the installer
must be run on a system with FIPS compliant openssl libraries. This means using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the system running the installer would need to have FIPS mode enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's not really relevant for this topic. Do you still want it added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without fips mode enabled, the binary won't use openssl at all, so I think it's a relevant detail to include.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three parts of the problem are:

  1. you must use the dynamically-linked binary (openshift-baremetal-install/openshift-install-fips) b/c sometimes it will need to do FIPS
  2. if you use a dynamically linked library, you must use a RHEL 9 userspace from 4.16 and may use one for pre-4.16
  3. but if you want to do FIPS pre-4.16 you must use a RHEL 8 userspace

present on our container image. Today this is not a problem because all
`openshift-install` binaries in use have been expecting to link to RHEL 8 based
openssl libraries, but now OpenShift 4.16 will ship and installer that requires
RHEL 9 libraries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we over-emphasizing the details of openssl and dynamic linking in this paragraph? AIUI nothing about how that works is changing. The second sentence seems to imply that there might be a statically-linked binary available, or that it's an option, but is that the case?

Can we boil this down to:

When FIPS mode is enabled on the host running the installer:

  • for openshift < 4.16, the installer must run in a RHEL 8 userspace
  • for openshift >= 4.16, the installer must run in a RHEL 9 userspace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence seems to imply that there might be a statically-linked binary available, or that it's an option, but is that the case?

Yes, there is. Actually I'm working on a bug right now that is solved by using the openshift-install binary from 4.16. https://issues.redhat.com/browse/OCPBUGS-33227

That said, it is that simple. I just figured it would be easier to do this split in any case.
If we also want to consider the static binary for non-fips cases we just add complexity (3 cases instead of just 2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would someone obtain a statically vs dynamically linked build?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of https://issues.redhat.com/browse/CORS-3418 openshift-install is statically linked and as of https://issues.redhat.com/browse/CORS-3419 it supports baremetal installs.

openshift-baremetal-install is now only intended for use with FIPS use-cases as that's the only time you'd need a dynamically linked binary. https://issues.redhat.com/browse/WRKLDS-1171 added a new command to oc release extract so it could be referenced as openshift-install-fips which more accurately represents its purpose.

assisted-service will take over as usual from there.

Each of these containers should maintain its own cache of installers as
assisted-service does now and should be built with the required packages to run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What packages are required? Is this just implying that one image will have a RHEL 8 base, and the other a RHEL 9 base, each of which comes with different content? Or are there literally specific packages that need to be installed with dnf in our Containerfiles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess the openshift-client oc CLI would? Or is that something we'd extract from the release image specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure. I'll add this to open questions, but I'd expect the required libraries it to be documented with the installer somewhere.


### Open Questions

1. What does the API look like for the runner containers? What data should be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple start would be to have a shared volume among all three containers.

  1. assisted-service would invoke an installer within the correct "runner container" and wait
  2. the installer would generate content and write it to the shared volume
  3. assisted-service would get the signal that the installer run was complete and get a reference to the location for its output

The shared storage could be ephemeral; it just needs to be persistent for the life of the pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do we expect the openshift-install binary to take? It should be quick enough that we could keep an http request open, and write the response to the client when it's done, right? No need for an asynchronous task API?

Maybe the exception would be if the installer binary isn't already in the local cache. Should assisted-service continue managing one cache that gets mounted into each installer-runner container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cache will be managed by the installer runner container and this container will also upload the resulting artifacts to the required location (either the shared data volume or s3).

I don't expect the operation to take too long. I think we could to a single http request for this. Even if we need to tweak the timeout parameters I'd rather that than build an async system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regarding the storage side of this.

The cache and the place where the generated files are stored after the installer is run is already a volume (specifically a PV) and this is what I intended to use for the shared storage, so I think that's a non-issue.

My question here was what exactly will we need to pass over the API call.


## Design Details [optional]

The new runner containers will expose a HTTP server listening only for local
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look reasonable?

  • A new installer-runner service will be created, written in go
  • The installer-runner will be compiled twice: once in a RHEL 8 builder image, and once in a RHEL 9 builder image, with each resulting binary being placed into a RHEL base image of corresponding version.

the installer. This seems possible, but doesn't make much sense if we want to
run a persistent service rather than a job per cluster. Running a job per
cluster wouldn't scale particularly well and would also be an even larger
architectural change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring k8s to run Jobs would also break compatibility for running assisted-installer with podman.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support assisted-installer on podman.
Also at some point podman will get it: containers/podman#17011.

The assisted-service can still be persistent it can use k8s jobs for executing the openshift-installer create-ignition manifests commands.
Why using a job is considered "a larger architectural change" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't offer normal product support for it. But as a project: it works, we invite people to try it, and people use and depend on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a job is considered "a larger architectural change" ?

I suppose it's not that much larger really, but it makes the runtime more complicated.
Running jobs over the kube API feels heavier to me than making an http call to a persistently running container in the same pod (effectively an RPC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more difficult to arrange for a shared filesystem with a Job.

You also have to account for all of the things that could happen when scheduling the job, including that it might end up on a node that fails, or on a node that gets drained right as it starts, etc. It would definitely require an asynchronous approach to watching and waiting for the work to be completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do go with Jobs then both ABI and podman would either need to stick with the current method of running the installer on the service container (which might be fine for ABI since it will always be using the matching image, and podman because we'll never support a FIPS cluster that way) or find some other way to spin up the image from the release to run the installer. In any case we end up with needing to support and maintain two methods for running the installer which isn't ideal.

I'm also worried about just how many jobs we might end up running at once.
I don't know exactly what the peak number of installs at a time is on the SaaS, but I know it's at least several hundred simultaneously and several thousand in a few hours when the scale team is testing the on-prem deployment through MCE, @akrzos likely has those numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we don't offer normal product support for the podman runtime method, I don't see a reason why someone couldn't use it on a FIPS-enabled system to install a FIPS-enabled cluster.

Copy link
Member

@zaneb zaneb May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we don't offer normal product support for the podman runtime method, I don't see a reason why someone couldn't use it on a FIPS-enabled system to install a FIPS-enabled cluster.

I don't disagree, but the question is more whether we need the upstream project to support installing both pre-4.16 and post-4.16 clusters with FIPS from the same podman deployment and not ask the user to create separate podman deployments. That sounds like the kind of requirement that products have but which upstream projects can dispense with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth I added some more detail about these alternatives to the actual enhancement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't build a different assisted-installer service to install one version of openshift vs. another. It's just one service that theoretically covers them all.

In theory we could make two parallel builds available and tell users to run one stack for version <=4.15 and the other for versions >4.16, but that seems like more work than just solving the problem properly for everyone whether they're running the upstream or downstream build.

## Design Details [optional]

The new runner containers will expose a HTTP server listening only for local
connections. assisted-service will POST to this server when it needs manifests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of local connections, it's worth considering a unix socket instead of the network stack. We did that for communicating with ansible-runner from an operator process. It's a simpler approach when all communication is local; no TLS, no auth, no DNS, no network protocols, ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree. I think a socket would work well for this.

but that still requires us to publish a new one for the architecture
assisted-service will not be using.

## Design Details [optional]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service closely matches what we need to implement. It could almost be borrowed as-is, mostly modifying what action it takes when receiving a request. Its purpose is to receive events from ansible-runner on a local unix socket shared between the containers.

https://github.com/operator-framework/ansible-operator-plugins/blob/main/internal/ansible/runner/eventapi/eventapi.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks about right to me, thanks for the reference

Comment on lines 43 to 44
- Dynamically determining a given release's RHEL version. The RHEL version will
be a part of configuration tied to the minor release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, how will this look? Will it be an index we maintain like the default_os/release images files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that, but now I'm rethinking it.

Maybe it would be better to just switch on if the version is 4.16 or later.
@mhrivnak mentioned OKD in this context. What do the cluster versions look like when OKD is in use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image list trimmed:

oc adm release info quay.io/openshift/okd@sha256:46b462be1e4c15ce5ab5fba97e713e8824bbb9f614ac5abe1be41fda916920cc
Name:           4.15.0-0.okd-2024-03-10-010116
Digest:         sha256:46b462be1e4c15ce5ab5fba97e713e8824bbb9f614ac5abe1be41fda916920cc
Created:        2024-03-10T06:53:44Z
OS/Arch:        linux/amd64
Manifests:      707
Metadata files: 1

Pull From: quay.io/openshift/okd@sha256:46b462be1e4c15ce5ab5fba97e713e8824bbb9f614ac5abe1be41fda916920cc

Release Metadata:
  Version:  4.15.0-0.okd-2024-03-10-010116
  Upgrades: <none>
  Metadata:

Component Versions:
  kubernetes 1.28.7        
  machine-os 39.20240210.3 Fedora CoreOS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'd still get 4.15.0-0.okd-2024-03-10-010116 in this case.

I mean I suppose if no one is doing FIPS with OKD then we should be able to apply the same rule and it will work out. (4.16 OKD would go to the el9 runner)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not always use fips compliant installer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will, but which FIPS compliant installer we use (rhel8 or rhel9) will be determined by the release version.

Comment on lines 48 to 49
Two additional containers will run alongside assisted-service in the same pod.
These "installer-runner" containers will, when triggered by assisted-service,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these containers expected to always run? Like even during non-fips mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea, yeah.
My intention is to use these containers for both fips and non-fips cases as well since there shouldn't really be a difference and if we were to only use them for fips they would get tested very infrequently.

generated. The runner container will respond with any error that occurred while
generating the manifests.

### Open Questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the new microservice to have its own repo? That may be a good end-state for the sake of isolated testing and such.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly called this out in New Images under Implementation Details, but if you think it would be good to put somewhere else or just talked about more prominently then we can do that.

I don't think it's an open question though. For this pass I'm going to create the service in assisted-service using the existing code to run the installer. Extracting and splitting everything up would definitely come later.

- Allow for a single installation of assisted-service to install FIPS-compliant
clusters using `openshift-install` binaries built against RHEL 8 or RHEL 9

- Allow for FIPS compliant clusters to be installed from the SaaS offering or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SaaS requires also apps SRE to chime in and install FIPS compliant clusters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have an open task for app-sre https://issues.redhat.com/browse/SDE-3692

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, also this comment was more about the solution allowing for this possibility not necessarily that, once implmented, we will be able to install FIPS from the SaaS. The main goal is to make FIPS work for both rhel8 and rhel9 installer releases.

Creating, maintaining, and releasing additional images is a non-trivial amount
of additional work.

## Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're saying that <4.16 is not fips complaint anyhow, can we say that for those releases we gonna use the statically linked one and for >=4.16, dynamically linked with rhel9? Can we use single existing container then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're saying that. OpenShift can be installed in FIPS mode today, and that's been the case for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ACM/SaaS world? or in general?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general and in ACM IIUC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sources for "OpenShift can be installed in FIPS mode today" include 4.12 docs for FIPS installation. And ROSA FedRAMP docs:

ROSA only uses FIPS-validated modules to process cryptographic libraries.

## Proposal

Two additional containers will run alongside assisted-service in the same pod.
These "installer-runner" containers will, when triggered by assisted-service,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggered how?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan right now is for an http API, I'll add it here.

the installer. This seems possible, but doesn't make much sense if we want to
run a persistent service rather than a job per cluster. Running a job per
cluster wouldn't scale particularly well and would also be an even larger
architectural change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support assisted-installer on podman.
Also at some point podman will get it: containers/podman#17011.

The assisted-service can still be persistent it can use k8s jobs for executing the openshift-installer create-ignition manifests commands.
Why using a job is considered "a larger architectural change" ?

@romfreiman
Copy link
Contributor

What about ABI? Wont it be affected by this change?

@carbonin
Copy link
Member Author

What about ABI? Wont it be affected by this change?

ABI doesn't actually have this problem since they build assisted-service on the base image that corresponds with the installer, but making this change might affect them if we don't also continue to support the current way we run the installer (in the assisted-service container).

Also it depends on what you mean by "affect". If, after this change, they also deploy assisted as described in the enhancement (with the sidecar containers) then things should just work for them, but this will affect them in that they will likely need to update the deployment code for their assisted-service instance.

cc @bfournie @pawanpinjarkar @zaneb

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ABI? Wont it be affected by this change?

We'll want to keep the same implementation, so we'll likely need to deploy the other container in ABI as well.
The actual issue doesn't affect ABI because we only deploy one version of OpenShift from each assisted-service, so we make sure it always matches.

## Summary

In order for an OpenShift cluster to be considered FIPS compliant the installer
must be run on a system with FIPS compliant openssl libraries. This means using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three parts of the problem are:

  1. you must use the dynamically-linked binary (openshift-baremetal-install/openshift-install-fips) b/c sometimes it will need to do FIPS
  2. if you use a dynamically linked library, you must use a RHEL 9 userspace from 4.16 and may use one for pre-4.16
  3. but if you want to do FIPS pre-4.16 you must use a RHEL 8 userspace

Creating, maintaining, and releasing additional images is a non-trivial amount
of additional work.

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered installing the RHEL 8 OpenSSL 1.1 into the RHEL 9 container alongside OpenSSL 3.0?
(Note: I haven't tried this to see what conflicts.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but I'm a bit worried about landing this in time and this approach feels like it has more unknown potential pitfalls than the approach described here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did bring up the alternative, which should probably be added to this doc, of extracting an entire rhel 8 userspace within a rhel 9 userspace, and using chroot to run an installer that needs rhel 8. But that brings other complexities around image build and management that probably aren't worthwhile compared to just running two copies of a small process in two different container images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need chroot. We can use ldpreload env to point to the relevant path when installing <4.16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this could be as simple as installing a separate ssl version somewhere else and setting some env vars when running the installer?

I can give that a try if there's a chance it might work.

Will ssl be the only library I need to override in this way? Everything else will be backward compatible? Will this actually be FIPS compliant? Are there any other requirements we would also need to satisfy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a rhel8 container (our current released image) the files provided by the older openssl rpm are these:

bash-4.4$ rpm -ql openssl-libs-1.1.1k-12.el8_9.x86_64
/etc/pki/tls
/etc/pki/tls/certs
/etc/pki/tls/ct_log_list.cnf
/etc/pki/tls/misc
/etc/pki/tls/openssl.cnf
/etc/pki/tls/private
/usr/lib/.build-id
/usr/lib/.build-id/14
/usr/lib/.build-id/14/0abbb0c09726652dd61128b74b8bb5010f5542
/usr/lib/.build-id/42
/usr/lib/.build-id/42/9301d8c47d78f29d7a39fe01d8076e7b656e4e
/usr/lib/.build-id/6f
/usr/lib/.build-id/6f/619a1566052cd522eb945d91197ad60852a8f8
/usr/lib/.build-id/bb
/usr/lib/.build-id/bb/0f1c2857f6f9e4f68c7c5dc36a41c441318eca
/usr/lib/.build-id/e1
/usr/lib/.build-id/e1/1d63787a0230c919507d7ebc73e8af7e8e8a2b
/usr/lib64/.libcrypto.so.1.1.1k.hmac
/usr/lib64/.libcrypto.so.1.1.hmac
/usr/lib64/.libssl.so.1.1.1k.hmac
/usr/lib64/.libssl.so.1.1.hmac
/usr/lib64/engines-1.1
/usr/lib64/engines-1.1/afalg.so
/usr/lib64/engines-1.1/capi.so
/usr/lib64/engines-1.1/padlock.so
/usr/lib64/libcrypto.so.1.1
/usr/lib64/libcrypto.so.1.1.1k
/usr/lib64/libssl.so.1.1
/usr/lib64/libssl.so.1.1.1k
/usr/share/licenses/openssl-libs
/usr/share/licenses/openssl-libs/LICENSE

I suppose I could use a multi-stage build to copy those files into a special directory on the final (rhel9) image then try use LD_PRELOAD to push those library versions instead of the ones installed on the system.

I'd need a fully FIPS enabled environment to test if it's working though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LD_PRELOAD (or rather, LD_LIBRARY_PATH which I think is what you're going for here) is not really the issue - libssl.so.1.1 and libssl.so.3.0 are different major versions of the shared library, so AIUI the binary will know to load the right one. This issue is what other files the RPM installs that might conflict. Mostly just those .cnf files by the looks of it?

Copy link
Member Author

@carbonin carbonin May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you can install different versions of the same RPM on the same system.
Also I don't know if the older ssl package is even available in the el9 repos.

My idea was to use a multi-stage container build, install the ssl rpm on an el8 image then copy the files directly into a directory in later stage. Then when we know we need to run the el8 binary we'd set whatever envs we need to allow it to find the old so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully I'll have an environment to test this in this morning and then if it works I'll run it by the FIPS experts to figure out if this process would be considered compliant.

In order for an OpenShift cluster to be considered FIPS compliant the installer
must be run on a system with FIPS mode enabled and with FIPS compliant openssl
libraries installed. This means using a dynamically linked `openshift-install`
binary against the openssl libraries present on our container image. Today this
is not a problem because all `openshift-install` binaries in use have been
expecting to link to RHEL 8 based openssl libraries, but now OpenShift 4.16 will
ship an installer that requires RHEL 9 libraries.

This will require assisted-service to maintain a way to run the
`openshift-install` binary in a compatible environment for multiple openssl
versions. Specifically FIPS-enabled installs for pre-4.16 releases will need to
be run on an el8 image and 4.16 and later releases will need to be run on an
el9 image (regardless of FIPS).

Resolves https://issues.redhat.com/browse/MGMT-17771
@carbonin carbonin force-pushed the multi-rhel-fips-enhancement branch from c1f6adf to 6366631 Compare May 14, 2024 12:37
Comment on lines +205 to +208
1. Create a separate userspace for el8 libraries and chroot when those libraries
are required.
- This seems a bit complicated and it will likely make our image quite a bit
larger than it already is (~1.3G).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out and was able to get both installers to run in a container on a FIPS-enabled RHEL9 host using chroot for the rhel8 one. I only needed to copy /usr from the el8 image and create the top-level symlinks in the chroot target dir. Based on my tests with the centos images this adds about 200M to the image size.

I still need to get assurance that this would actually result in a FIPS-compliant cluster, but if it does it seems like the most promising option to me.

I think this is worth a POC. @mhrivnak @romfreiman What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this approach cause a problem for image scanning and tracking of packages for CVEs and such?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked about this in the FIPS-related internal channels and it turns out nothing like this would be considered FIPS-compliant so this disqualifies any approach that doesn't run the installer on a base container image that it is expecting to run on.

I'll add this to the enhancement text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chroot :)

@carbonin carbonin force-pushed the multi-rhel-fips-enhancement branch from 78e9ee1 to 0b01fd9 Compare May 16, 2024 13:01
@carbonin carbonin force-pushed the multi-rhel-fips-enhancement branch from 0b01fd9 to 9a69ecd Compare May 16, 2024 19:31
image *must* match the installer for the resulting cluster to be considered
FIPS-compliant so none of these multi-library options are valid.

### Publish multiple assisted-service images
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romfreiman does this look like what you were talking about?
Tried to capture it here.


### Publish multiple assisted-service images

It's likely that a user installing in FIPS mode will only be interested in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we think this is true?

Everyone starts with a single version. But existing clusters and their use cases lag behind new clusters and new use cases. I would expect that most users end up with a mix of openshift versions over time.

Keep in mind that users need the ability to re-provision their clusters at any time, for recovery from all kind of unexpected events. And a growing number of customers depend on regular "cluster as a service" provisioning. Telling such a user that they can only (re-)provision a subset of openshift versions would generally be a big disadvantage.

Even a large-scale user with relatively uniform versions is not going to upgrade all of their clusters at once. It would not be ideal to put them in a position where they can only provision clusters for the newest openshift versions in their fleet.

Is there reason to think that a FIPS user would have different patterns and expectations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this is an assumption on my part, but my experience with even moderately conservative customers is that they have a single version of OCP that is vetted and allowed in the org (at least in production).

My guess was that any user interested in full FIPS compliance would be even more strict than that, but I have no actual users to back any of this up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for upgrade and reprovisioning. I do consider this option a temporary solution that buys us time to properly think through and vet something that would work more generically. Ideally in the version following the one this alternative would be implemented in we'd implement the main body of this enhancement (or something similar).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They start with a single vetted version, then that progresses over time, but upgrades lag.

I think it's still better to implement this enhancement as described, which would fully preserve the existing use cases and user experience, and then optimize it later as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak all good. can we find such customer now? we'll have it as an rfe and improve in the next release. I dont think we should restructure the whole service 2 weeks before the release.
for next release, lets reconsider.

choose which to deploy based on configuration (likely an annotation since a more
robust solution would be preferred in the future).

For example, in the case that a user knew they wanted to deploy OCP 4.14 from a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the user who owns the configuration of the management cluster itself is the same user who is provisioning clusters. But often that role is separate. The "Cluster Creator" would need to work with the cluster admin to re-configure assisted-installer to switch it from el8 mode to el9 mode. Then what if they need to switch back? This could become a real hassle.

It would help to have automation that recognizes when the config doesn't work for a desired install operation, but resolving that could be a pain. Even for cases where it is the same person doing both personas, it's inconvenient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already something that the application admin is choosing to a degree when they set the available OS images, so I don't think it's too much of a stretch for them to also communicate which OCP versions will be in use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. So then the time will come when the cluster creator asks the admin to add a new version that requires removal of all other versions. Then they'll have to negotiate an upgrade plan with all stakeholders to get the whole fleet safely over the hump, etc etc. Doesn't that seem like a pain? If we can avoid creating that burden for our customers, we'll all be better-off.

Copy link

openshift-ci bot commented Dec 5, 2024

@carbonin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-metal-assisted-cnv-4-16 e87cd02 link true /test edge-e2e-metal-assisted-cnv-4-16
ci/prow/edge-e2e-metal-assisted-odf-4-16 e87cd02 link true /test edge-e2e-metal-assisted-odf-4-16
ci/prow/edge-e2e-metal-assisted-mtv-4-17 e87cd02 link true /test edge-e2e-metal-assisted-mtv-4-17
ci/prow/edge-e2e-metal-assisted-osc-4-17 e87cd02 link true /test edge-e2e-metal-assisted-osc-4-17
ci/prow/edge-e2e-metal-assisted-osc-sno-4-17 e87cd02 link true /test edge-e2e-metal-assisted-osc-sno-4-17

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants