Skip to content

Inject certs for metrics endpoint when cert-manager is enabled #478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ardaguclu
Copy link
Contributor

@ardaguclu ardaguclu commented Mar 28, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it

ServiceMonitor resource is configured to serve TLS with insecureSkipVerify: true . However, this is not recommended for production use cases. This PR injects the appropriate certificates to the ServiceMonitor to serve from the trusted certificates within the cluster, when cert-manager (or another external certificate management tool) is enabled.

This PR configures kustomize and helm charts according to this.

There is no functional change, when internal cert management is enabled (which is the default setting).

This PR is inspired by kubernetes-sigs/kueue#4385

Does this PR introduce a user-facing change?

Injecting certificates to ServiceMonitor in order to serve securely, when cert-manager is enabled

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ardaguclu
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 28, 2025
Copy link

netlify bot commented Mar 28, 2025

Deploy Preview for kubernetes-sigs-lws canceled.

Name Link
🔨 Latest commit 21ae63b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-lws/deploys/67f7a3e1a60b3b0008109cd0

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025
@yankay
Copy link
Member

yankay commented Apr 1, 2025

I'm not sure, whether additional documentation needs to be supplemented later, like kubernetes-sigs/kueue#4598

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2025
@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch from c5ded7e to c26d738 Compare April 2, 2025 05:10
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2025
@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch from c26d738 to 2fcf1fa Compare April 2, 2025 05:11
@ardaguclu
Copy link
Contributor Author

ardaguclu commented Apr 2, 2025

I'm not sure, whether additional documentation needs to be supplemented later, like kubernetes-sigs/kueue#4598

I think, adding documentation in a followup PR is reasonable and useful.

/cc @kerthcet

@@ -108,6 +108,10 @@ type InternalCertManagement struct {
// Enable controls whether to enable internal cert management or not.
// Defaults to true. If you want to use a third-party management, e.g. cert-manager,
// set it to false. See the user guide for more information.
// This secret is mounted to the lws controller manager pod. The mount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What secret do you mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for dropping your comment. I think, mentioning this behavior in this flag is not relevant. So that I've removed this comment entirely.

@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch from 2fcf1fa to aa1a9c4 Compare April 3, 2025 06:34
Copy link

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I have opened a similar PR for the jobset controller: kubernetes-sigs/jobset#863

key: ca.crt
cert:
secret:
name: metrics-server-cert
Copy link

@atiratree atiratree Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is a bit weird to let the prometheus client use the same cert that we use in our server. I have left this up to the user to specify which certificate their prometheus instance is using https://github.com/kubernetes-sigs/jobset/pull/863/files#diff-d1323215bef44ba29d323d647a042eed78e05ce849be1e3f8c067a89fd735359R23-R26

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are hardcoding nearly all fields and in my opinion we can hardcode it too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is not entirely correct to let the prometheus use this as a client cert for auth. As mentioned in kubernetes-sigs/jobset#863 (comment) there are many things that have to be configured before running cert manager + prometheus. So I think the best way would be to let user decide if they want to use the server certificate for this. They can easily opt in to it.

I have tried to document this in:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many things that have to be configured before running cert manager + prometheus.

We only require to uncomment some configurations in kustomize.

If user relies on cert-manager, we can use the same certificates both in client and server because source of truth is the same (i.e. cert-manager) within this cluster.

If user relies on internal certificate management, skipping tls verification in servicemonitor would simply work. If we force users to find and inject a certificate in servicemonitor, users can't inject any arbitrary certificate. This certificate must comply with the one that is managed internally in LWS. I think, this dramatically hurts the user experience. Because I don't think, as a user I can successfully configure this cluster on my own.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the usability perspective, but technically that certificate is wrong. In mutual TLS the domain name would be off.

I am okay, by adding it as a default, but IMO it should be configurable and we should write a disclaimer there.

@@ -23,6 +23,7 @@ resources:
- ../webhook
# [CERTMANAGER] To enable cert-manager, uncomment all sections with 'CERTMANAGER'. 'WEBHOOK' components are required.
- ../internalcert
#- ../certmanager
# [PROMETHEUS] To enable prometheus monitor, uncomment all sections with 'PROMETHEUS'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I think it would be good to allow passing certs to metrics even without a cert manager: https://github.com/kubernetes-sigs/jobset/pull/863/files#diff-c0dbb993e3ea0a70d85d76af178d499d9fe0ffa7fc9ad44d58d1acc7875e71d0R91

@atiratree
Copy link

cc @kannon92, the questions can apply also to kubernetes-sigs/kueue#4385

@ardaguclu
Copy link
Contributor Author

/hold
to apply the suggested changes from @atiratree

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2025
@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch from aa1a9c4 to ec507c1 Compare April 4, 2025 05:27
@ardaguclu
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 4, 2025
@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch from ec507c1 to 34deba8 Compare April 7, 2025 07:51
@ardaguclu
Copy link
Contributor Author

I've tested this PR on a live cluster and it worked without any issue. Certificates are injected correctly and Prometheus successfully scraped the metrics in a secure way.
@kerthcet would you mind taking a look at this PR, when you have a chance?.

spec:
dnsNames:
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc.cluster.local

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if the service name referenced a helm variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, users prefer simply applying these configurations to work with whatever set as default. Probably there is no functional need to configure the service name. I'll defer the decision to approvers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying it has to be configurable. I just mean to share a variable so we do not have to harcdode the name in multiple places.

@@ -0,0 +1,18 @@
{{- if .Values.enableCertManager }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this depend both on the cert manager and prometheus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cert-manager is enabled, metrics will be configured to run with TLS. That is independent from whether prometheus is enabled or not. If prometheus is not enabled, that means nobody scrapes metrics, but metrics server still runs in secure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it makes sense to secure the metrics even without prometheus scraping. I will update my PR as well.

{{- if .Values.enableCertManager }}
- name: metrics-cert
secret:
secretName: lws-metrics-server-cert

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better if we could reference a common variable here and in the certificate

@@ -88,3 +93,16 @@ spec:
secret:
defaultMode: 420
secretName: lws-webhook-server-cert
{{- if .Values.enableCertManager }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mount the secret when prometheus is disabled?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -49,3 +49,7 @@ resources:
nodeSelector: {}
tolerations: []
affinity: {}
metrics:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lws-prometheus-k8s rolebinding monitoring namespace should be configurable IMO

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: metrics-cert # this name should match the one appeared in kustomizeconfig.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helm review applies here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check any missing piece in helm configurations. But I tested kustomize configurations extensively and it works.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,5 +1,6 @@
resources:
- certificate.yaml
- certificate-metrics.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to enable this by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default cert-manager is disabled. If we enable this by default, I think we would get Certificate resource not found error?.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense, let's secure metrics regardless of prometheus.

@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch 2 times, most recently from 86f5385 to 5e67291 Compare April 8, 2025 10:42
@ardaguclu ardaguclu force-pushed the secure-serving-metrics branch from 5e67291 to db31af0 Compare April 8, 2025 10:50
key: ca.crt
cert:
secret:
name: metrics-server-cert

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the usability perspective, but technically that certificate is wrong. In mutual TLS the domain name would be off.

I am okay, by adding it as a default, but IMO it should be configurable and we should write a disclaimer there.

@@ -0,0 +1,18 @@
{{- if .Values.enableCertManager }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it makes sense to secure the metrics even without prometheus scraping. I will update my PR as well.

spec:
dnsNames:
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc.cluster.local

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying it has to be configurable. I just mean to share a variable so we do not have to harcdode the name in multiple places.

@@ -88,3 +93,16 @@ spec:
secret:
defaultMode: 420
secretName: lws-webhook-server-cert
{{- if .Values.enableCertManager }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -15,7 +15,23 @@ spec:
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
{{- if .Values.enableCertManager }}
serverName: {{ include "lws.fullname" . }}-controller-manager-metrics-service.{{ .Release.Namespace }}.svc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use a variable here

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: metrics-cert # this name should match the one appeared in kustomizeconfig.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,5 +1,6 @@
resources:
- certificate.yaml
- certificate-metrics.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense, let's secure metrics regardless of prometheus.

@ardaguclu ardaguclu mentioned this pull request Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants