Skip to content

Conversation

karthikvetrivel
Copy link

This PR adds unit tests for the ServiceMonitor and Service controller functions, which Prometheus ServiceMonitor custom resources and Kubernetes Service resources for GPU telemetry components.

Changes

  1. Added TestServiceMonitor to validate ServiceMonitor lifecycle management
  2. Added TestService to validate Service resource management

Test Coverage

  • Correct state transitions based on component enablement
  • Handling of missing ServiceMonitor CRD
  • Resource creation, updates, and deletion
  • Different component states (dcgm-exporter, node-status-exporter, operator-metrics)

Test coverage bump from 18.7% to 19.7%

Copy link

copy-pr-bot bot commented Sep 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@karthikvetrivel karthikvetrivel force-pushed the test/add-service-servicemonitor-tests branch from 02d09d7 to bf8398e Compare September 18, 2025 19:12
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikvetrivel this is a great start! I made a first pass and left some feedback on the TestServiceMonitor method.

Comment on lines +1098 to +1108
testCases := []struct {
description string
stateName string
crdPresent bool
dcgmEnabled *bool
nodeStatusEnabled *bool
dcgmSMEnabled *bool
withEdits bool
wantState gpuv1.State
extraAssert assertsFn
}{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of separate fields for the various ClusterPolicy options, why not just define one field that contains the ClusterPolicy spec for the test case? For example, what about reducing this to:

Suggested change
testCases := []struct {
description string
stateName string
crdPresent bool
dcgmEnabled *bool
nodeStatusEnabled *bool
dcgmSMEnabled *bool
withEdits bool
wantState gpuv1.State
extraAssert assertsFn
}{
testCases := []struct {
description string
stateName string
crdPresent bool
cpSpec gpuv1.ClusterPolicySpec
wantState gpuv1.State
extraAssert assertsFn
}{

This would eliminate the need for the code you have later on that updates the ClusterPolicy object for each test case.

Comment on lines +1177 to +1191
// Base ClusterPolicy
cp := &gpuv1.ClusterPolicy{Spec: gpuv1.ClusterPolicySpec{}}
// Configure enables
if tc.dcgmEnabled != nil {
cp.Spec.DCGMExporter.Enabled = tc.dcgmEnabled
}
if tc.nodeStatusEnabled != nil {
cp.Spec.NodeStatusExporter.Enabled = tc.nodeStatusEnabled
}
// Configure DCGM SM
if tc.stateName == "state-dcgm-exporter" {
if tc.dcgmSMEnabled != nil {
cp.Spec.DCGMExporter.ServiceMonitor = &gpuv1.DCGMExporterServiceMonitorConfig{Enabled: tc.dcgmSMEnabled}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, I think we could remove the need for this code if we just defined the ClusterPolicy spec for each test case.

Comment on lines +1201 to +1210
// If edits are requested, seed CP config for edits
if tc.withEdits {
cp.Spec.DCGMExporter.ServiceMonitor = &gpuv1.DCGMExporterServiceMonitorConfig{
Enabled: truePtr,
Interval: promv1.Duration("15s"),
HonorLabels: truePtr,
AdditionalLabels: map[string]string{"a": "b"},
Relabelings: []*promv1.RelabelConfig{{Action: "keep"}},
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, I think we could remove the need for this code if we just embedded this configuration in the test case itself.

Comment on lines +1193 to +1200
// Build the ServiceMonitor resource template
sm := promv1.ServiceMonitor{
ObjectMeta: metav1.ObjectMeta{Name: "test-sm", Labels: map[string]string{}},
Spec: promv1.ServiceMonitorSpec{
NamespaceSelector: promv1.NamespaceSelector{MatchNames: []string{"FILLED BY THE OPERATOR"}},
Endpoints: []promv1.Endpoint{{}},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -- since this ServiceMonitor template does not differ between test cases, do we want to define this variable outside of the for loop?

Comment on lines +1169 to +1175
// Build fake client, optionally seed CRD existence by registering type only
b := fake.NewClientBuilder().WithScheme(scheme)
if tc.crdPresent {
crd := &apiextensionsv1.CustomResourceDefinition{ObjectMeta: metav1.ObjectMeta{Name: ServiceMonitorCRDName}}
b = b.WithObjects(crd)
}
k8sClient := b.Build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but we could consider adding a client field for each test case and building the fake client in-line for each test case. For example, for test cases where the ServiceMonitor CRD does not exist we would have:

client: fake.NewClientBuilder().WithScheme(scheme).Build()

while for the test cases where the ServiceMonitor CRD does exist we would have:

client: fake.NewClientBuilder().WithScheme(scheme).WithObjects(serviceMonitorCRD).Build()

Comment on lines +1153 to +1164
extraAssert: func(t *testing.T, c client.Client, name, ns string) {
// Verify object created with edits
found := &promv1.ServiceMonitor{}
err := c.Get(context.TODO(), client.ObjectKey{Namespace: ns, Name: name}, found)
require.NoError(t, err)
require.Equal(t, promv1.Duration("15s"), found.Spec.Endpoints[0].Interval)
require.Equal(t, true, found.Spec.Endpoints[0].HonorLabels)
require.Equal(t, "b", found.Labels["a"])
require.NotNil(t, found.Spec.Endpoints[0].RelabelConfigs)
require.Equal(t, 1, len(found.Spec.Endpoints[0].RelabelConfigs))
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining custom assert logic here, why don't we define the expected ServiceMonitor object here and compare it to the actual object that gets created when executing the test case?

Comment on lines +1228 to +1230
if tc.extraAssert != nil {
tc.extraAssert(t, k8sClient, "test-sm", "test-ns")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested in a prior comment, for test cases where we expect a ServiceMonitor object to be created, would it be possible to just compare the two service monitor objects (expected vs actual)?

dcgmSMEnabled: truePtr,
withEdits: true,
wantState: gpuv1.Ready,
extraAssert: func(t *testing.T, c client.Client, name, ns string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining a closure to define additional assert blocks, can we just define a separate test case to assert the extra fields? That would make it easier to read.

Personally speaking, I'd avoid using closures and just sticking to more procedural coding. It makes test cases easier to read

wantState: gpuv1.Ready,
extraAssert: func(t *testing.T, c client.Client, name, ns string) {
found := &corev1.Service{}
err := c.Get(context.TODO(), client.ObjectKey{Namespace: ns, Name: name}, found)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use context.TODO() when we intend to pass in a ctx object later on. Since this most likely isn't a TODO, I would suggest using context.Background() instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants