Skip to content

feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xiaohuanshu
Copy link

Motivation

When monitoring large-scale Kubernetes resources, the traditional ListWatch pattern requires waiting for full list completion before starting watch operations. This creates a significant risk of the resource_version becoming expired during long list operations, resulting in 410 Gone errors that force full re-list operations. The problem intensifies in clusters with 10,000+ resources where initial list operations may take minutes.

Solution

Introduces a new ListWatchParallel strategy that:

​​Initiates watch immediately after first list pagination​​
​​Processes subsequent list paginations and watch events in parallel​​
​​Guarantees ordered delivery​​ of list data before cached watch events
​​Maintains version consistency​​ using the first pagination's resource version

Added InitialListStrategy::ListWatchParallel variant
Extended state machine with WatchedInitPage (parallel processing) and WatchedInitListed (cached event drain) states
Implemented event buffering via stream_events queue

… resources

Initiate Watch immediately after the first List pagination completes and cache events to prevent resource version expiration when listing large-scale resources.

- Add InitialListStrategy::ListWatchParallel strategy
- Implement parallel List pagination and Watch event processing with caching
- Introduce WatchedInitPage and WatchedInitListed state machine states

Signed-off-by: xiaohuanshu <[email protected]>
Copy link
Member

@clux clux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there. That's a high-scale use-case you have there 😄
Thanks for raising this.

This is a complex solution (to a problem I didn't know existed), and therefore I have a couple of questions/concerns about this, and a potential alternate approach.

Basically;

  1. Do you think this is temporary?
  2. Is there precedent for this in the go ecosystem?

More specifically;

  1. Would this problem be avoided with StreamingList? Streaming should give you bookmarks in between InitApply events and might make this solution unnecessary. Possibly, you are blocked by the feature being widely available? It is still beta in 1.33.
  2. We are probably not the first runtime library to run into this situation of cluster size and large lists. Do you know if controller-runtime / client-go has setups for this? I'm curious because we have guarantees that watcher needs to uphold for reflector, and this technically creates a watcher mode that breaks these guarantees (as well as pushing more complexity into the state machinery of reflector).

Potential idea

It feels like the parallelism could be achieved by merging two streams with an abstraction above watcher; something that starts a list and gets a bookmark, then starts a watcher from that bookmark and merges the stream. That way we might avoid making watcher much more complicated than it already is. This might require exposing some more parameters to watcher though.

The complexity of this PR (at the moment) would make watcher hard to maintain, and hard to test.

}
// Attempt to asynchronously fetch events from the Watch Stream and cache them.
// If an error occurs at this stage, restart the list operation.
loop {
Copy link
Member

@clux clux Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately awkward and complex 😢
The step function (which were intended to do a single step), can now potentially loop for a long time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step only buffers incremental events from the Watch stream into memory cache. The polling operation terminates immediately when no new events are available. Since the I/O operations are asynchronous and non-blocking, their overhead is negligible compared to the inherent latency of list operations. Therefore, this does not significantly increase the total list+watch processing time.

Copy link

codecov bot commented Apr 21, 2025

Codecov Report

Attention: Patch coverage is 13.20755% with 92 lines in your changes missing coverage. Please review.

Project coverage is 75.2%. Comparing base (d171d26) to head (212083c).

Files with missing lines Patch % Lines
kube-runtime/src/watcher.rs 13.3% 92 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1748     +/-   ##
=======================================
- Coverage   76.2%   75.2%   -0.9%     
=======================================
  Files         84      84             
  Lines       7852    7953    +101     
=======================================
+ Hits        5977    5978      +1     
- Misses      1875    1975    +100     
Files with missing lines Coverage Δ
kube-runtime/src/watcher.rs 31.0% <13.3%> (-13.6%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

InitialListStrategy::ListWatchParallel => {
// start watch
match api
.watch(&wc.to_watch_params(), &last_bookmark.clone().unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unwrap fallible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a preceding last_bookmark.is_none() check, so this unwrap is guaranteed to be safe here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preceding check only fails if continue is also none.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. I will fix it later

{
Ok(stream) => (None, State::WatchedInitPage {
continue_token,
objects: list.items.into_iter().collect(),
Copy link
Contributor

@SOF3 SOF3 Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VecDeque<T> implements From<Vec<T>>

@SOF3
Copy link
Contributor

SOF3 commented Apr 22, 2025

Are you listing with resourceVersion=0 or not in the first place?

  • If you set resourceVersion="0", apiserver ignores the limit param and always returns a full list, because the apiserver local cache doesn't support pagination. You can verify this by running kubectl get --raw /api/v1/pods?limit=10\&resourceVersion=0; it will return way more than 10 items.
  • If you set resourceVersion="", apiserver has to penetrate the local cache and list from etcd directly, which is exactly the reason why your list is so slow.

@xiaohuanshu
Copy link
Author

xiaohuanshu commented Apr 22, 2025

Hey there. That's a high-scale use-case you have there 😄 Thanks for raising this.

This is a complex solution (to a problem I didn't know existed), and therefore I have a couple of questions/concerns about this, and a potential alternate approach.

Basically;

  1. Do you think this is temporary?
  2. Is there precedent for this in the go ecosystem?

More specifically;

  1. Would this problem be avoided with StreamingList? Streaming should give you bookmarks in between InitApply events and might make this solution unnecessary. Possibly, you are blocked by the feature being widely available? It is still beta in 1.33.
  2. We are probably not the first runtime library to run into this situation of cluster size and large lists. Do you know if controller-runtime / client-go has setups for this? I'm curious because we have guarantees that watcher needs to uphold for reflector, and this technically creates a watcher mode that breaks these guarantees (as well as pushing more complexity into the state machinery of reflector).

Potential idea

It feels like the parallelism could be achieved by merging two streams with an abstraction above watcher; something that starts a list and gets a bookmark, then starts a watcher from that bookmark and merges the stream. That way we might avoid making watcher much more complicated than it already is. This might require exposing some more parameters to watcher though.

The complexity of this PR (at the moment) would make watcher hard to maintain, and hard to test.

Thank you for the thoughtful review!

While StreamingList is indeed a better and more recommended solution, it's still in beta. Even after its official release, it may take several years for existing clusters to complete upgrades.

After reviewing the client-go code, I confirm there's currently no logic to start watch after the first pagination.
However, the Kubernetes official API operation recommendations (https://kubernetes.io/docs/reference/using-api/api-concepts/#retrieving-large-results-sets-in-chunks) mention: "Notice that the resourceVersion of the collection remains constant across each request......This allows you to break large requests into smaller chunks and then perform a watch operation on the full set without missing any updates."
Based on this guidance, I developed this feature. I have a cluster with 25,000+ pods where we consistently encounter version expiration after list operations, resulting in 100% 410 Gone errors during watch. This feature was previously manually implemented in a Python SDK project and has been running stably. Recently during migration to Rust and kube-rs, I noticed the absence of this functionality and thus submitted this PR.

Perhaps my large cluster scenario is relatively uncommon, or maybe it's due to kube-rs currently using JSON instead of protobuf which might make list serialization slower. In any case, I'm mainly providing a problem-solving approach for reference~

@xiaohuanshu
Copy link
Author

Are you listing with resourceVersion=0 or not in the first place?

  • If you set resourceVersion="0", apiserver ignores the limit param and always returns a full list, because the apiserver local cache doesn't support pagination. You can verify this by running kubectl get --raw /api/v1/pods?limit=10\&resourceVersion=0; it will return way more than 10 items.
  • If you set resourceVersion="", apiserver has to penetrate the local cache and list from etcd directly, which is exactly the reason why your list is so slow.

Thank you for the suggestion. I'm using resourceVersion="" and have verified the pagination logic is correctly implemented. One potential factor I've observed is that kube-rs currently uses JSON instead of Protobuf, which likely causes poor performance during list deserialization. This manifests as a ​​100% failure rate​​ on my cluster with 25,000+ pods.

@SOF3
Copy link
Contributor

SOF3 commented Apr 22, 2025

If your cluster is already so large, penetrating apiserver cache would be a bad idea since this would easily result in apiserver OOM or even etcd OOM during e.g. many controllers are relisting concurrently.

Some alternative solutions:

  • increase watch cache size in your apiserver configuration
  • start multiple reflectors with separate label/namespace selectors, at the cost of loss of inter-object strong consistency (which you apparently do not need either)

@xiaohuanshu
Copy link
Author

If your cluster is already so large, penetrating apiserver cache would be a bad idea since this would easily result in apiserver OOM or even etcd OOM during e.g. many controllers are relisting concurrently.

Some alternative solutions:

  • increase watch cache size in your apiserver configuration
  • start multiple reflectors with separate label/namespace selectors, at the cost of loss of inter-object strong consistency (which you apparently do not need either)

Strong consistency must be guaranteed. The first-pagination-initiated watch approach has been running stably for over 2 years in my Python-based production system, demonstrating production-grade reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants