-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a TTL Cache for DescribeInstanceTopology API call response #1117
base: master
Are you sure you want to change the base?
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This issue is currently awaiting triage. If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @shvbsle. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@@ -85,12 +102,20 @@ func NewInstanceTopologyManager(ec2 services.Ec2SdkV2, cfg *config.CloudConfig) | |||
supportedTopologyInstanceTypePattern: supportedTopologyInstanceTypePattern, | |||
// These should change very infrequently, if ever, so checking once a day sounds fair. | |||
unsupportedKeyStore: cache.NewTTLStore(topStringKeyFunc, instanceTopologyManagerCacheTimeout), | |||
// In this cache we store the response made to DescribeInstanceTopology API for an instanceID | |||
instanceTopologyAPIResponseCache: cache.NewTTLStore(topologyCacheKeyFunc, instanceTopologyAPIResponseCacheTimeout), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't it cause an issue with the memory usage if there is some churn of the nodes in a day. Since we are storing topology struct, can we estimate how much memory will be used for an instance which should help justify if its okay or not? Also, is it not possible to remove from cache on node deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we will only cache for instance-types that support topology labels. Now, the instance-topology struct would contain NetworkNodes
which is a list of strings. Most supported nodes would have three entries in this list and for 100k unique instances, the memory foot-print of caching would come out to ~15mb
Also, is it not possible to remove from cache on node deletion?
correct me if I'm wrong but I dont think that we are listening for node-deletion events. We'd have to loop through the cache and check if the nodes are deleted which would require additional API calls. The TTL for the cache entry is 24 hours in the worst-case, a deleted node cache entry would be flushed 24 hours later.
const instanceTopologyManagerCacheTimeout = 24 * time.Hour | ||
const ( | ||
instanceTopologyManagerCacheTimeout = 24 * time.Hour | ||
instanceTopologyAPIResponseCacheTimeout = 24 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The 24 * time.Hour cache timeout will lead to cache for all existing nodes timeout at the same time. Can we have per record timeout with jitter which smoothens the call pattern?
- Can we add unit test which validate lifecycle of items in the cache? Cache miss leads to nodeTopology call and saving response, cache expiry works etc.
/hold I don’t think this is the right solution to this issue, let’s chat about it |
What type of PR is this?
What this PR does / why we need it:
cloud-provider's node-controller attempts to update node-status every 5 minutes. If an instance that supports topology labels doesn't already have it keeps calling the DescribeInstanceTopology API repeatedly. This happens because,
UpdateNodeStatus
in cloud-provider does not actually applyadditionalLabels
.This change caches the DescribeInstanceTopology API call response for an instance-ID and thus reduces the volume of DescribeInstanceTopology API calls made. If a node is terminated, then the TTL cache would evict the cached response eventually thus ensuring that the cache does not grow unbounded.
Related PR:
kubernetes/kubernetes#130888
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: