Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add tagging controller delays and work queue size metrics #1116

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

shvbsle
Copy link

@shvbsle shvbsle commented Mar 14, 2025

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind feature
/kind flake

What this PR does / why we need it:
Adds tagging controller delays metric. This measure the delay between node-creation and tagging of the EC2 Instance. The metrics should show up like so:

# HELP tagging_controller_node_tagging_delay_seconds [ALPHA] Number of seconds after node creation when TaggingController successfully tagged or untagged the node resources.
# TYPE tagging_controller_node_tagging_delay_seconds histogram
tagging_controller_node_tagging_delay_seconds_bucket{le="1"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="4"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="16"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="64"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="256"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="1024"} 0
tagging_controller_node_tagging_delay_seconds_bucket{le="+Inf"} 0
tagging_controller_node_tagging_delay_seconds_sum 0
tagging_controller_node_tagging_delay_seconds_count 0

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 14, 2025
@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Mar 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign olemarkus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 14, 2025
@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 14, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @shvbsle. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 14, 2025
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 14, 2025
StabilityLevel: metrics.ALPHA,
},
)
workQueueSize = metrics.NewGauge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should already have an equivalent metric from the workqueue itself:

_ "k8s.io/component-base/metrics/prometheus/workqueue" // enable prometheus provider for workqueue metrics

they're defined here: https://github.com/kubernetes/client-go/blob/master/util/workqueue/metrics.go

and the metrics should be emitted with the Name of the workqueue (probably a prefix?):

Name: TaggingControllerClientName,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that we have equivalent metric for worker queue size already. They show up under the label tagging-controller. It is this:

workqueue_depth{name="tagging-controller"} 0

Will remove workQueueSize

@@ -269,6 +274,7 @@ func (tc *Controller) process() bool {
}

tc.workqueue.Forget(obj)
nodeTaggingDelay.Observe(time.Since(currentNode.CreationTimestamp.Time).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want this within tagEc2Resource, around:

klog.Infof("Successfully labeled node %s with %v.", node.GetName(), labels)

Since we only want to record it when we're tagging (not un-tagging) and we don't want to record it unless tags are actually applied

@shvbsle shvbsle marked this pull request as ready for review March 20, 2025 05:29
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2025
@k8s-ci-robot k8s-ci-robot requested review from hakman and kmala March 20, 2025 05:29
@@ -342,6 +342,7 @@ func (tc *Controller) tagEc2Instance(node *v1.Node) error {

klog.Infof("Successfully labeled node %s with %v.", node.GetName(), labels)

nodeTaggingDelay.Observe(time.Since(node.CreationTimestamp.Time).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why node creation time? there can also be an update to the tags which would cause retag right?
If we want to know just for the current iteration we already have work queue metrics

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent is to observe tagging delays especially during the node-startup and in clusters with a large number of nodes. Since the emitted metric is a Histogram, a re-tag event would fall in Inf+ buckets. The histogram would still allow us to get a reliable p90 metric to observe delays during node-startup

Copy link
Member

@kmala kmala Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be different from the work queue metrics as the only additional thing this might add is the time it takes to add to the workqueue which should be mostly quick/immediate. Otherwise this is mostly sum of work queue latency and work duration metrics right?
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/workqueue/metrics.go#L55-L63

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are only interested in tagging delays. IIUC, workqueue_queue_duration_seconds_bucket does not provide a way to distinguish between a tagging, an untagging or an error event, which makes it an unreliable proxy for what we want to measure.

@shvbsle shvbsle requested a review from kmala March 27, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants