Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-disruptive certificate rotation #4891

Open
guydc opened this issue Dec 11, 2024 · 3 comments
Open

Non-disruptive certificate rotation #4891

guydc opened this issue Dec 11, 2024 · 3 comments
Labels
area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet.

Comments

@guydc
Copy link
Contributor

guydc commented Dec 11, 2024

Description:
Currently, we have MTLS connections for:

  • Envoy Gateway <> Envoy
  • Envoy <> Rate Limit server

Envoy Gateway generates client and server certificates for all of the above components and typically provides them as mounted secrets to relevant pods. A job that runs in the helm pre-install and pre-upgrade hooks is responsible for rotation.

It is a common security practice to use short-lived certificates that are rotated frequently. In Envoy Gateway, CA certificates and leaf certificates are handled with the same level of security (storage, access, ... ), and should both be rotated.

To support frequent and non-disruptive rotation, the following is required:

  • All components are capable of dynamically reloading certificates when they change.
  • Certificates are rotated in a backwards-compatible manner. For example, a new trusted CA is added and the old CA is removed only after all components were able to load the latest CA/leaf certificate. This is especially relevant for data-plane components like Rate Limit where client-facing failures may occur if Envoy and Rate Limit are not both synchronized on the latests CA/leaf certificates.

Currently, these requirements are not met:

[optional Relevant Links:]

Any extra documentation required to understand the issue.

@guydc guydc added the area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet. label Dec 11, 2024
@zhaohuabing
Copy link
Member

zhaohuabing commented Dec 18, 2024

Hi @guydc thanks for explaining the issue-it's very clear!

From my understanding, here's what we need to do:

  • Keep the old CA when rotating certs
  • The EG, Envoy, and ratelimit server should load both the old and new CA to ensure the verification won't break in the intermediate transition period.

@kfox1111
Copy link

Maybe directly, or indirectly related to this, would be spire support (https://github.com/spiffe/spire). Should this get its own ticket?

spire can push updated certificates to the workload as it rotates things. Typically the certs may be replaced each hour, and the ca's updated daily.

Would love to be able to use spire with the envoy gateway.

@guydc
Copy link
Contributor Author

guydc commented Jan 9, 2025

Hi @kfox1111, I think this would deserve its own issue. Are you interested in using SPIRE only for the Gateway itself (Envoy, Envoy Gateway, Rate Limit server, ... ), or also for communication with the backends?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra-mgr Issues related to the provisioner used for provisioning the managed Envoy Proxy fleet.
Projects
None yet
Development

No branches or pull requests

3 participants