Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add tagging controller delays and work queue size metrics #1116

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions pkg/controllers/tagging/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ import (
"k8s.io/component-base/metrics/legacyregistry"
)

const (
metricsSubsystem = "tagging_controller"
)

var register sync.Once

var (
Expand All @@ -30,12 +34,23 @@ var (
StabilityLevel: metrics.ALPHA,
},
[]string{"error_type", "instance_id"})

nodeTaggingDelay = metrics.NewHistogram(
&metrics.HistogramOpts{
Subsystem: metricsSubsystem,
Name: "node_tagging_delay_seconds",
Help: "Number of seconds after node creation when TaggingController successfully tagged or untagged the node resources.",
Buckets: metrics.ExponentialBuckets(1, 4, 6), // 1s -> ~17m
StabilityLevel: metrics.ALPHA,
},
)
)

// registerMetrics registers tagging-controller metrics.
func registerMetrics() {
register.Do(func() {
legacyregistry.MustRegister(workItemError)
legacyregistry.MustRegister(nodeTaggingDelay)
})
}

Expand Down
1 change: 1 addition & 0 deletions pkg/controllers/tagging/tagging_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ func (tc *Controller) tagEc2Instance(node *v1.Node) error {

klog.Infof("Successfully labeled node %s with %v.", node.GetName(), labels)

nodeTaggingDelay.Observe(time.Since(node.CreationTimestamp.Time).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why node creation time? there can also be an update to the tags which would cause retag right?
If we want to know just for the current iteration we already have work queue metrics

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent is to observe tagging delays especially during the node-startup and in clusters with a large number of nodes. Since the emitted metric is a Histogram, a re-tag event would fall in Inf+ buckets. The histogram would still allow us to get a reliable p90 metric to observe delays during node-startup

Copy link
Member

@kmala kmala Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be different from the work queue metrics as the only additional thing this might add is the time it takes to add to the workqueue which should be mostly quick/immediate. Otherwise this is mostly sum of work queue latency and work duration metrics right?
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/workqueue/metrics.go#L55-L63

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are only interested in tagging delays. IIUC, workqueue_queue_duration_seconds_bucket does not provide a way to distinguish between a tagging, an untagging or an error event, which makes it an unreliable proxy for what we want to measure.

return nil
}

Expand Down