feat: add tagging controller delays and work queue size metrics #1116

shvbsle · 2025-03-14T20:38:23Z

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind feature
/kind flake

What this PR does / why we need it:
Adds tagging controller delays metric. This measure the delay between node-creation and tagging of the EC2 Instance. The metrics should show up like so:

# HELP tagging_controller_node_tagging_delay_seconds [ALPHA] Number of seconds after node creation when TaggingController successfully tagged or untagged the node resources.
# TYPE tagging_controller_node_tagging_delay_seconds histogram
tagging_controller_node_tagging_delay_seconds_bucket{le="1"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="4"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="16"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="64"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="256"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="1024"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="+Inf"} 0
tagging_controller_node_tagging_delay_seconds_sum 0
tagging_controller_node_tagging_delay_seconds_count 0

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

k8s-ci-robot · 2025-03-14T20:38:26Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-03-14T20:38:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign olemarkus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-03-14T20:38:32Z

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-03-14T20:38:33Z

Hi @shvbsle. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cartermckinnon · 2025-03-14T23:22:52Z

pkg/controllers/tagging/metrics.go

+			StabilityLevel: metrics.ALPHA,
+		},
+	)
+	workQueueSize = metrics.NewGauge(


I think we should already have an equivalent metric from the workqueue itself:

cloud-provider-aws/pkg/controllers/tagging/tagging_controller.go

Line 36 in 9358055

_ "k8s.io/component-base/metrics/prometheus/workqueue" // enable prometheus provider for workqueue metrics

they're defined here: https://github.com/kubernetes/client-go/blob/master/util/workqueue/metrics.go

and the metrics should be emitted with the Name of the workqueue (probably a prefix?):

cloud-provider-aws/pkg/controllers/tagging/tagging_controller.go

Line 147 in 9358055

Name: TaggingControllerClientName,

Confirmed that we have equivalent metric for worker queue size already. They show up under the label tagging-controller. It is this:

workqueue_depth{name="tagging-controller"} 0

Will remove workQueueSize

cartermckinnon · 2025-03-14T23:30:10Z

pkg/controllers/tagging/tagging_controller.go

@@ -269,6 +274,7 @@ func (tc *Controller) process() bool {
 		}

 		tc.workqueue.Forget(obj)
+		nodeTaggingDelay.Observe(time.Since(currentNode.CreationTimestamp.Time).Seconds())


I think we want this within tagEc2Resource, around:

cloud-provider-aws/pkg/controllers/tagging/tagging_controller.go

Line 343 in 9358055

klog.Infof("Successfully labeled node %s with %v.", node.GetName(), labels)

Since we only want to record it when we're tagging (not un-tagging) and we don't want to record it unless tags are actually applied

…f tagging delay inside tagEc2Instance

kmala · 2025-03-20T07:19:52Z

pkg/controllers/tagging/tagging_controller.go

@@ -342,6 +342,7 @@ func (tc *Controller) tagEc2Instance(node *v1.Node) error {

 	klog.Infof("Successfully labeled node %s with %v.", node.GetName(), labels)

+	nodeTaggingDelay.Observe(time.Since(node.CreationTimestamp.Time).Seconds())


why node creation time? there can also be an update to the tags which would cause retag right?
If we want to know just for the current iteration we already have work queue metrics

Intent is to observe tagging delays especially during the node-startup and in clusters with a large number of nodes. Since the emitted metric is a Histogram, a re-tag event would fall in Inf+ buckets. The histogram would still allow us to get a reliable p90 metric to observe delays during node-startup

How would this be different from the work queue metrics as the only additional thing this might add is the time it takes to add to the workqueue which should be mostly quick/immediate. Otherwise this is mostly sum of work queue latency and work duration metrics right?
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/workqueue/metrics.go#L55-L63

Currently we are only interested in tagging delays. IIUC, workqueue_queue_duration_seconds_bucket does not provide a way to distinguish between a tagging, an untagging or an error event, which makes it an unreliable proxy for what we want to measure.

feat: add tagging controller delays and work queue size metrics

3a1825f

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 14, 2025

k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Mar 14, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 14, 2025

k8s-ci-robot requested review from cartermckinnon and olemarkus March 14, 2025 20:38

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 14, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 14, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 14, 2025

clean up

ab2a55b

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 14, 2025

cartermckinnon reviewed Mar 14, 2025

View reviewed changes

shvbsle added 2 commits March 20, 2025 05:27

Removed redundante work queue size metric and moved the measurement o…

1397aa6

…f tagging delay inside tagEc2Instance

added back log lines

72f918a

shvbsle marked this pull request as ready for review March 20, 2025 05:29

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2025

k8s-ci-robot requested review from hakman and kmala March 20, 2025 05:29

kmala reviewed Mar 20, 2025

View reviewed changes

shvbsle requested a review from kmala March 27, 2025 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add tagging controller delays and work queue size metrics #1116

feat: add tagging controller delays and work queue size metrics #1116

shvbsle commented Mar 14, 2025 •

edited

Loading

k8s-ci-robot commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

cartermckinnon Mar 14, 2025

shvbsle Mar 20, 2025

cartermckinnon Mar 14, 2025

kmala Mar 20, 2025

shvbsle Mar 20, 2025

kmala Mar 20, 2025 •

edited

Loading

shvbsle Mar 20, 2025

		@@ -342,6 +342,7 @@ func (tc Controller) tagEc2Instance(node v1.Node) error {

		klog.Infof("Successfully labeled node %s with %v.", node.GetName(), labels)

		nodeTaggingDelay.Observe(time.Since(node.CreationTimestamp.Time).Seconds())

feat: add tagging controller delays and work queue size metrics #1116

Are you sure you want to change the base?

feat: add tagging controller delays and work queue size metrics #1116

Conversation

shvbsle commented Mar 14, 2025 • edited Loading

k8s-ci-robot commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

cartermckinnon Mar 14, 2025

Choose a reason for hiding this comment

shvbsle Mar 20, 2025

Choose a reason for hiding this comment

cartermckinnon Mar 14, 2025

Choose a reason for hiding this comment

kmala Mar 20, 2025

Choose a reason for hiding this comment

shvbsle Mar 20, 2025

Choose a reason for hiding this comment

kmala Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

shvbsle Mar 20, 2025

Choose a reason for hiding this comment

shvbsle commented Mar 14, 2025 •

edited

Loading

kmala Mar 20, 2025 •

edited

Loading