feat(tests): Add unit tests for DCGM exporter Service and ServiceMonitor #1707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

karthikvetrivel wants to merge 1 commit into NVIDIA:main from karthikvetrivel:test/add-service-servicemonitor-tests

+285 −0

karthikvetrivel commented Sep 18, 2025

This PR adds unit tests for the ServiceMonitor and Service controller functions, which Prometheus ServiceMonitor custom resources and Kubernetes Service resources for GPU telemetry components.

Changes

Added TestServiceMonitor to validate ServiceMonitor lifecycle management
Added TestService to validate Service resource management

Test Coverage

Correct state transitions based on component enablement
Handling of missing ServiceMonitor CRD
Resource creation, updates, and deletion
Different component states (dcgm-exporter, node-status-exporter, operator-metrics)

Test coverage bump from 18.7% to 19.7%

karthikvetrivel requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners

September 18, 2025 19:09

copy-pr-bot bot commented Sep 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


          feat(tests): Add unit tests for DCGM exporter Service and ServiceMoni…

bf8398e

…tor reconciliation

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the test/add-service-servicemonitor-tests branch from 02d09d7 to bf8398e Compare

September 18, 2025 19:12

cdesiniotis reviewed

View reviewed changes

Contributor

cdesiniotis left a comment

@karthikvetrivel this is a great start! I made a first pass and left some feedback on the TestServiceMonitor method.

controllers/object_controls_test.go

Comment on lines +1098 to +1108

    
              	testCases := []struct {

              		description       string

              		stateName         string

              		crdPresent        bool

              		dcgmEnabled       *bool

              		nodeStatusEnabled *bool

              		dcgmSMEnabled     *bool

              		withEdits         bool

              		wantState         gpuv1.State

              		extraAssert       assertsFn

              	}{

Contributor

cdesiniotis Sep 18, 2025

Instead of separate fields for the various ClusterPolicy options, why not just define one field that contains the ClusterPolicy spec for the test case? For example, what about reducing this to:

Suggested change

      
            	testCases := []struct {
          
            		description       string
          
            		stateName         string
          
            		crdPresent        bool
          
            		dcgmEnabled       *bool
          
            		nodeStatusEnabled *bool
          
            		dcgmSMEnabled     *bool
          
            		withEdits         bool
          
            		wantState         gpuv1.State
          
            		extraAssert       assertsFn
          
            	}{
          
            	testCases := []struct {
          
            		description       string
          
            		stateName         string
          
            		crdPresent        bool
          
            		cpSpec            gpuv1.ClusterPolicySpec
          
            		wantState         gpuv1.State
          
            		extraAssert       assertsFn
          
            	}{

This would eliminate the need for the code you have later on that updates the ClusterPolicy object for each test case.

controllers/object_controls_test.go

Comment on lines +1177 to +1191

    
              			// Base ClusterPolicy

              			cp := &gpuv1.ClusterPolicy{Spec: gpuv1.ClusterPolicySpec{}}

              			// Configure enables

              			if tc.dcgmEnabled != nil {

              				cp.Spec.DCGMExporter.Enabled = tc.dcgmEnabled

              			}

              			if tc.nodeStatusEnabled != nil {

              				cp.Spec.NodeStatusExporter.Enabled = tc.nodeStatusEnabled

              			}

              			// Configure DCGM SM

              			if tc.stateName == "state-dcgm-exporter" {

              				if tc.dcgmSMEnabled != nil {

              					cp.Spec.DCGMExporter.ServiceMonitor = &gpuv1.DCGMExporterServiceMonitorConfig{Enabled: tc.dcgmSMEnabled}

              				}

              			}

Contributor

cdesiniotis Sep 18, 2025

As noted above, I think we could remove the need for this code if we just defined the ClusterPolicy spec for each test case.

controllers/object_controls_test.go

Comment on lines +1201 to +1210

    
              			// If edits are requested, seed CP config for edits

              			if tc.withEdits {

              				cp.Spec.DCGMExporter.ServiceMonitor = &gpuv1.DCGMExporterServiceMonitorConfig{

              					Enabled:          truePtr,

              					Interval:         promv1.Duration("15s"),

              					HonorLabels:      truePtr,

              					AdditionalLabels: map[string]string{"a": "b"},

              					Relabelings:      []*promv1.RelabelConfig{{Action: "keep"}},

              				}

              			}

Contributor

cdesiniotis Sep 18, 2025

As noted above, I think we could remove the need for this code if we just embedded this configuration in the test case itself.

controllers/object_controls_test.go

Comment on lines +1193 to +1200

    
              			// Build the ServiceMonitor resource template

              			sm := promv1.ServiceMonitor{

              				ObjectMeta: metav1.ObjectMeta{Name: "test-sm", Labels: map[string]string{}},

              				Spec: promv1.ServiceMonitorSpec{

              					NamespaceSelector: promv1.NamespaceSelector{MatchNames: []string{"FILLED BY THE OPERATOR"}},

              					Endpoints:         []promv1.Endpoint{{}},

              				},

              			}

Contributor

cdesiniotis Sep 18, 2025

nit -- since this ServiceMonitor template does not differ between test cases, do we want to define this variable outside of the for loop?

controllers/object_controls_test.go

Comment on lines +1169 to +1175

    
              			// Build fake client, optionally seed CRD existence by registering type only

              			b := fake.NewClientBuilder().WithScheme(scheme)

              			if tc.crdPresent {

              				crd := &apiextensionsv1.CustomResourceDefinition{ObjectMeta: metav1.ObjectMeta{Name: ServiceMonitorCRDName}}

              				b = b.WithObjects(crd)

              			}

              			k8sClient := b.Build()

Contributor

cdesiniotis Sep 18, 2025

Not critical, but we could consider adding a client field for each test case and building the fake client in-line for each test case. For example, for test cases where the ServiceMonitor CRD does not exist we would have:

client: fake.NewClientBuilder().WithScheme(scheme).Build()

while for the test cases where the ServiceMonitor CRD does exist we would have:

client: fake.NewClientBuilder().WithScheme(scheme).WithObjects(serviceMonitorCRD).Build()

controllers/object_controls_test.go

Comment on lines +1153 to +1164

    
              			extraAssert: func(t *testing.T, c client.Client, name, ns string) {

              				// Verify object created with edits

              				found := &promv1.ServiceMonitor{}

              				err := c.Get(context.TODO(), client.ObjectKey{Namespace: ns, Name: name}, found)

              				require.NoError(t, err)

              				require.Equal(t, promv1.Duration("15s"), found.Spec.Endpoints[0].Interval)

              				require.Equal(t, true, found.Spec.Endpoints[0].HonorLabels)

              				require.Equal(t, "b", found.Labels["a"])

              				require.NotNil(t, found.Spec.Endpoints[0].RelabelConfigs)

              				require.Equal(t, 1, len(found.Spec.Endpoints[0].RelabelConfigs))

              			},

              		},

Contributor

cdesiniotis Sep 18, 2025

Instead of defining custom assert logic here, why don't we define the expected ServiceMonitor object here and compare it to the actual object that gets created when executing the test case?

controllers/object_controls_test.go

Comment on lines +1228 to +1230

    
              			if tc.extraAssert != nil {

              				tc.extraAssert(t, k8sClient, "test-sm", "test-ns")

              			}

Contributor

cdesiniotis Sep 18, 2025

As suggested in a prior comment, for test cases where we expect a ServiceMonitor object to be created, would it be possible to just compare the two service monitor objects (expected vs actual)?

tariq1890 reviewed

View reviewed changes

controllers/object_controls_test.go

    
              			dcgmSMEnabled: truePtr,

              			withEdits:     true,

              			wantState:     gpuv1.Ready,

              			extraAssert: func(t *testing.T, c client.Client, name, ns string) {

Contributor

tariq1890 Sep 18, 2025

Instead of defining a closure to define additional assert blocks, can we just define a separate test case to assert the extra fields? That would make it easier to read.

Personally speaking, I'd avoid using closures and just sticking to more procedural coding. It makes test cases easier to read

controllers/object_controls_test.go

    
              			wantState: gpuv1.Ready,

              			extraAssert: func(t *testing.T, c client.Client, name, ns string) {

              				found := &corev1.Service{}

              				err := c.Get(context.TODO(), client.ObjectKey{Namespace: ns, Name: name}, found)

Contributor

tariq1890 Sep 18, 2025

We only use context.TODO() when we intend to pass in a ctx object later on. Since this most likely isn't a TODO, I would suggest using context.Background() instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

tariq1890 tariq1890 left review comments

cdesiniotis cdesiniotis left review comments

ArangoGutierrez Awaiting requested review from ArangoGutierrez ArangoGutierrez is a code owner

elezar Awaiting requested review from elezar elezar is a code owner

shivamerla Awaiting requested review from shivamerla shivamerla is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet