Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a TTL Cache for DescribeInstanceTopology API call response #1117

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shvbsle
Copy link

@shvbsle shvbsle commented Mar 19, 2025

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind bug
/kind feature

What this PR does / why we need it:
cloud-provider's node-controller attempts to update node-status every 5 minutes. If an instance that supports topology labels doesn't already have it keeps calling the DescribeInstanceTopology API repeatedly. This happens because, UpdateNodeStatus in cloud-provider does not actually apply additionalLabels.

This change caches the DescribeInstanceTopology API call response for an instance-ID and thus reduces the volume of DescribeInstanceTopology API calls made. If a node is terminated, then the TTL cache would evict the cached response eventually thus ensuring that the cache does not grow unbounded.

Related PR:
kubernetes/kubernetes#130888

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wongma7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from kishorj and kmala March 19, 2025 23:35
@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 19, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 19, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @shvbsle. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 19, 2025
@shvbsle shvbsle marked this pull request as ready for review March 19, 2025 23:35
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025
@k8s-ci-robot k8s-ci-robot requested a review from dims March 19, 2025 23:35
@@ -85,12 +102,20 @@ func NewInstanceTopologyManager(ec2 services.Ec2SdkV2, cfg *config.CloudConfig)
supportedTopologyInstanceTypePattern: supportedTopologyInstanceTypePattern,
// These should change very infrequently, if ever, so checking once a day sounds fair.
unsupportedKeyStore: cache.NewTTLStore(topStringKeyFunc, instanceTopologyManagerCacheTimeout),
// In this cache we store the response made to DescribeInstanceTopology API for an instanceID
instanceTopologyAPIResponseCache: cache.NewTTLStore(topologyCacheKeyFunc, instanceTopologyAPIResponseCacheTimeout),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it cause an issue with the memory usage if there is some churn of the nodes in a day. Since we are storing topology struct, can we estimate how much memory will be used for an instance which should help justify if its okay or not? Also, is it not possible to remove from cache on node deletion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we will only cache for instance-types that support topology labels. Now, the instance-topology struct would contain NetworkNodes which is a list of strings. Most supported nodes would have three entries in this list and for 100k unique instances, the memory foot-print of caching would come out to ~15mb

Also, is it not possible to remove from cache on node deletion?

correct me if I'm wrong but I dont think that we are listening for node-deletion events. We'd have to loop through the cache and check if the nodes are deleted which would require additional API calls. The TTL for the cache entry is 24 hours in the worst-case, a deleted node cache entry would be flushed 24 hours later.

const instanceTopologyManagerCacheTimeout = 24 * time.Hour
const (
instanceTopologyManagerCacheTimeout = 24 * time.Hour
instanceTopologyAPIResponseCacheTimeout = 24 * time.Hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The 24 * time.Hour cache timeout will lead to cache for all existing nodes timeout at the same time. Can we have per record timeout with jitter which smoothens the call pattern?
  2. Can we add unit test which validate lifecycle of items in the cache? Cache miss leads to nodeTopology call and saving response, cache expiry works etc.

@cartermckinnon
Copy link
Contributor

/hold

I don’t think this is the right solution to this issue, let’s chat about it

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2025
@shvbsle shvbsle requested a review from kmala March 27, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants