Adds a TTL Cache for DescribeInstanceTopology API call response #1117

shvbsle · 2025-03-19T23:34:53Z

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind bug
/kind feature

What this PR does / why we need it:
cloud-provider's node-controller attempts to update node-status every 5 minutes. If an instance that supports topology labels doesn't already have it keeps calling the DescribeInstanceTopology API repeatedly. This happens because, UpdateNodeStatus in cloud-provider does not actually apply additionalLabels.

This change caches the DescribeInstanceTopology API call response for an instance-ID and thus reduces the volume of DescribeInstanceTopology API calls made. If a node is terminated, then the TTL cache would evict the cached response eventually thus ensuring that the cache does not grow unbounded.

Related PR:
kubernetes/kubernetes#130888

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

k8s-ci-robot · 2025-03-19T23:34:56Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-03-19T23:35:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wongma7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-03-19T23:35:02Z

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-03-19T23:35:04Z

Hi @shvbsle. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kmala · 2025-03-20T16:58:35Z

pkg/resourcemanagers/topology.go

@@ -85,12 +102,20 @@ func NewInstanceTopologyManager(ec2 services.Ec2SdkV2, cfg *config.CloudConfig)
 		supportedTopologyInstanceTypePattern: supportedTopologyInstanceTypePattern,
 		// These should change very infrequently, if ever, so checking once a day sounds fair.
 		unsupportedKeyStore: cache.NewTTLStore(topStringKeyFunc, instanceTopologyManagerCacheTimeout),
+		// In this cache we store the response made to DescribeInstanceTopology API for an instanceID
+		instanceTopologyAPIResponseCache: cache.NewTTLStore(topologyCacheKeyFunc, instanceTopologyAPIResponseCacheTimeout),


Won't it cause an issue with the memory usage if there is some churn of the nodes in a day. Since we are storing topology struct, can we estimate how much memory will be used for an instance which should help justify if its okay or not? Also, is it not possible to remove from cache on node deletion?

Note that we will only cache for instance-types that support topology labels. Now, the instance-topology struct would contain NetworkNodes which is a list of strings. Most supported nodes would have three entries in this list and for 100k unique instances, the memory foot-print of caching would come out to ~15mb

Also, is it not possible to remove from cache on node deletion?

correct me if I'm wrong but I dont think that we are listening for node-deletion events. We'd have to loop through the cache and check if the nodes are deleted which would require additional API calls. The TTL for the cache entry is 24 hours in the worst-case, a deleted node cache entry would be flushed 24 hours later.

shiv-amz · 2025-03-20T21:02:28Z

pkg/resourcemanagers/topology.go

-const instanceTopologyManagerCacheTimeout = 24 * time.Hour
+const (
+	instanceTopologyManagerCacheTimeout     = 24 * time.Hour
+	instanceTopologyAPIResponseCacheTimeout = 24 * time.Hour


The 24 * time.Hour cache timeout will lead to cache for all existing nodes timeout at the same time. Can we have per record timeout with jitter which smoothens the call pattern?

Can we add unit test which validate lifecycle of items in the cache? Cache miss leads to nodeTopology call and saving response, cache expiry works etc.

cartermckinnon · 2025-03-20T21:12:17Z

/hold

I don’t think this is the right solution to this issue, let’s chat about it

Adds a TTL Cache for DescribeInstanceTopology API call response

4f9dc21

k8s-ci-robot requested review from kishorj and kmala March 19, 2025 23:35

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 19, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 19, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 19, 2025

shvbsle marked this pull request as ready for review March 19, 2025 23:35

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025

k8s-ci-robot requested a review from dims March 19, 2025 23:35

kmala reviewed Mar 20, 2025

View reviewed changes

shiv-amz reviewed Mar 20, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2025

shvbsle requested a review from kmala March 27, 2025 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a TTL Cache for DescribeInstanceTopology API call response #1117

Adds a TTL Cache for DescribeInstanceTopology API call response #1117

shvbsle commented Mar 19, 2025 •

edited

Loading

k8s-ci-robot commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

kmala Mar 20, 2025

shvbsle Mar 20, 2025

shiv-amz Mar 20, 2025

cartermckinnon commented Mar 20, 2025

Adds a TTL Cache for DescribeInstanceTopology API call response #1117

Are you sure you want to change the base?

Adds a TTL Cache for DescribeInstanceTopology API call response #1117

Conversation

shvbsle commented Mar 19, 2025 • edited Loading

k8s-ci-robot commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

kmala Mar 20, 2025

Choose a reason for hiding this comment

shvbsle Mar 20, 2025

Choose a reason for hiding this comment

shiv-amz Mar 20, 2025

Choose a reason for hiding this comment

cartermckinnon commented Mar 20, 2025

shvbsle commented Mar 19, 2025 •

edited

Loading