-
Notifications
You must be signed in to change notification settings - Fork 68
Inject certs for metrics endpoint when cert-manager is enabled #478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ardaguclu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-lws canceled.
|
I'm not sure, whether additional documentation needs to be supplemented later, like kubernetes-sigs/kueue#4598 |
c5ded7e
to
c26d738
Compare
c26d738
to
2fcf1fa
Compare
I think, adding documentation in a followup PR is reasonable and useful. /cc @kerthcet |
@@ -108,6 +108,10 @@ type InternalCertManagement struct { | |||
// Enable controls whether to enable internal cert management or not. | |||
// Defaults to true. If you want to use a third-party management, e.g. cert-manager, | |||
// set it to false. See the user guide for more information. | |||
// This secret is mounted to the lws controller manager pod. The mount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What secret do you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for dropping your comment. I think, mentioning this behavior in this flag is not relevant. So that I've removed this comment entirely.
2fcf1fa
to
aa1a9c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I have opened a similar PR for the jobset controller: kubernetes-sigs/jobset#863
key: ca.crt | ||
cert: | ||
secret: | ||
name: metrics-server-cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it is a bit weird to let the prometheus client use the same cert that we use in our server. I have left this up to the user to specify which certificate their prometheus instance is using https://github.com/kubernetes-sigs/jobset/pull/863/files#diff-d1323215bef44ba29d323d647a042eed78e05ce849be1e3f8c067a89fd735359R23-R26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are hardcoding nearly all fields and in my opinion we can hardcode it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it is not entirely correct to let the prometheus use this as a client cert for auth. As mentioned in kubernetes-sigs/jobset#863 (comment) there are many things that have to be configured before running cert manager + prometheus. So I think the best way would be to let user decide if they want to use the server certificate for this. They can easily opt in to it.
I have tried to document this in:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are many things that have to be configured before running cert manager + prometheus.
We only require to uncomment some configurations in kustomize.
If user relies on cert-manager, we can use the same certificates both in client and server because source of truth is the same (i.e. cert-manager) within this cluster.
If user relies on internal certificate management, skipping tls verification in servicemonitor would simply work. If we force users to find and inject a certificate in servicemonitor, users can't inject any arbitrary certificate. This certificate must comply with the one that is managed internally in LWS. I think, this dramatically hurts the user experience. Because I don't think, as a user I can successfully configure this cluster on my own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the usability perspective, but technically that certificate is wrong. In mutual TLS the domain name would be off.
I am okay, by adding it as a default, but IMO it should be configurable and we should write a disclaimer there.
@@ -23,6 +23,7 @@ resources: | |||
- ../webhook | |||
# [CERTMANAGER] To enable cert-manager, uncomment all sections with 'CERTMANAGER'. 'WEBHOOK' components are required. | |||
- ../internalcert | |||
#- ../certmanager | |||
# [PROMETHEUS] To enable prometheus monitor, uncomment all sections with 'PROMETHEUS'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I think it would be good to allow passing certs to metrics even without a cert manager: https://github.com/kubernetes-sigs/jobset/pull/863/files#diff-c0dbb993e3ea0a70d85d76af178d499d9fe0ffa7fc9ad44d58d1acc7875e71d0R91
cc @kannon92, the questions can apply also to kubernetes-sigs/kueue#4385 |
/hold |
aa1a9c4
to
ec507c1
Compare
/hold cancel |
ec507c1
to
34deba8
Compare
I've tested this PR on a live cluster and it worked without any issue. Certificates are injected correctly and Prometheus successfully scraped the metrics in a secure way. |
spec: | ||
dnsNames: | ||
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc | ||
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc.cluster.local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if the service name referenced a helm variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, users prefer simply applying these configurations to work with whatever set as default. Probably there is no functional need to configure the service name. I'll defer the decision to approvers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying it has to be configurable. I just mean to share a variable so we do not have to harcdode the name in multiple places.
@@ -0,0 +1,18 @@ | |||
{{- if .Values.enableCertManager }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this depend both on the cert manager and prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cert-manager is enabled, metrics will be configured to run with TLS. That is independent from whether prometheus is enabled or not. If prometheus is not enabled, that means nobody scrapes metrics, but metrics server still runs in secure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it makes sense to secure the metrics even without prometheus scraping. I will update my PR as well.
{{- if .Values.enableCertManager }} | ||
- name: metrics-cert | ||
secret: | ||
secretName: lws-metrics-server-cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better if we could reference a common variable here and in the certificate
@@ -88,3 +93,16 @@ spec: | |||
secret: | |||
defaultMode: 420 | |||
secretName: lws-webhook-server-cert | |||
{{- if .Values.enableCertManager }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mount the secret when prometheus is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not relevant anymore: https://github.com/kubernetes-sigs/lws/pull/478/files#r2032112051
charts/lws/values.yaml
Outdated
@@ -49,3 +49,7 @@ resources: | |||
nodeSelector: {} | |||
tolerations: [] | |||
affinity: {} | |||
metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lws-prometheus-k8s rolebinding monitoring namespace should be configurable IMO
apiVersion: cert-manager.io/v1 | ||
kind: Certificate | ||
metadata: | ||
name: metrics-cert # this name should match the one appeared in kustomizeconfig.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helm review applies here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check any missing piece in helm configurations. But I tested kustomize configurations extensively and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant mostly this https://github.com/kubernetes-sigs/lws/pull/478/files#r2032107950
@@ -1,5 +1,6 @@ | |||
resources: | |||
- certificate.yaml | |||
- certificate-metrics.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to enable this by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default cert-manager is disabled. If we enable this by default, I think we would get Certificate
resource not found error?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense, let's secure metrics regardless of prometheus.
86f5385
to
5e67291
Compare
5e67291
to
db31af0
Compare
key: ca.crt | ||
cert: | ||
secret: | ||
name: metrics-server-cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the usability perspective, but technically that certificate is wrong. In mutual TLS the domain name would be off.
I am okay, by adding it as a default, but IMO it should be configurable and we should write a disclaimer there.
@@ -0,0 +1,18 @@ | |||
{{- if .Values.enableCertManager }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it makes sense to secure the metrics even without prometheus scraping. I will update my PR as well.
spec: | ||
dnsNames: | ||
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc | ||
- lws-controller-manager-metrics-service.{{ .Release.Namespace }}.svc.cluster.local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying it has to be configurable. I just mean to share a variable so we do not have to harcdode the name in multiple places.
@@ -88,3 +93,16 @@ spec: | |||
secret: | |||
defaultMode: 420 | |||
secretName: lws-webhook-server-cert | |||
{{- if .Values.enableCertManager }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not relevant anymore: https://github.com/kubernetes-sigs/lws/pull/478/files#r2032112051
@@ -15,7 +15,23 @@ spec: | |||
port: https | |||
scheme: https | |||
tlsConfig: | |||
insecureSkipVerify: true | |||
{{- if .Values.enableCertManager }} | |||
serverName: {{ include "lws.fullname" . }}-controller-manager-metrics-service.{{ .Release.Namespace }}.svc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use a variable here
apiVersion: cert-manager.io/v1 | ||
kind: Certificate | ||
metadata: | ||
name: metrics-cert # this name should match the one appeared in kustomizeconfig.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant mostly this https://github.com/kubernetes-sigs/lws/pull/478/files#r2032107950
@@ -1,5 +1,6 @@ | |||
resources: | |||
- certificate.yaml | |||
- certificate-metrics.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense, let's secure metrics regardless of prometheus.
What type of PR is this?
/kind feature
What this PR does / why we need it
ServiceMonitor resource is configured to serve TLS with
insecureSkipVerify: true
. However, this is not recommended for production use cases. This PR injects the appropriate certificates to the ServiceMonitor to serve from the trusted certificates within the cluster, when cert-manager (or another external certificate management tool) is enabled.This PR configures kustomize and helm charts according to this.
There is no functional change, when internal cert management is enabled (which is the default setting).
This PR is inspired by kubernetes-sigs/kueue#4385
Does this PR introduce a user-facing change?