feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748

xiaohuanshu · 2025-04-21T09:56:56Z

Motivation

When monitoring large-scale Kubernetes resources, the traditional ListWatch pattern requires waiting for full list completion before starting watch operations. This creates a significant risk of the resource_version becoming expired during long list operations, resulting in 410 Gone errors that force full re-list operations. The problem intensifies in clusters with 10,000+ resources where initial list operations may take minutes.

Solution

Introduces a new ListWatchParallel strategy that:

Initiates watch immediately after first list pagination
Processes subsequent list paginations and watch events in parallel
Guarantees ordered delivery of list data before cached watch events
Maintains version consistency using the first pagination's resource version

Added InitialListStrategy::ListWatchParallel variant
Extended state machine with WatchedInitPage (parallel processing) and WatchedInitListed (cached event drain) states
Implemented event buffering via stream_events queue

… resources Initiate Watch immediately after the first List pagination completes and cache events to prevent resource version expiration when listing large-scale resources. - Add InitialListStrategy::ListWatchParallel strategy - Implement parallel List pagination and Watch event processing with caching - Introduce WatchedInitPage and WatchedInitListed state machine states Signed-off-by: xiaohuanshu <[email protected]>

clux

Hey there. That's a high-scale use-case you have there 😄
Thanks for raising this.

This is a complex solution (to a problem I didn't know existed), and therefore I have a couple of questions/concerns about this, and a potential alternate approach.

Basically;

Do you think this is temporary?
Is there precedent for this in the go ecosystem?

More specifically;

Would this problem be avoided with StreamingList? Streaming should give you bookmarks in between InitApply events and might make this solution unnecessary. Possibly, you are blocked by the feature being widely available? It is still beta in 1.33.
We are probably not the first runtime library to run into this situation of cluster size and large lists. Do you know if controller-runtime / client-go has setups for this? I'm curious because we have guarantees that watcher needs to uphold for reflector, and this technically creates a watcher mode that breaks these guarantees (as well as pushing more complexity into the state machinery of reflector).

Potential idea

It feels like the parallelism could be achieved by merging two streams with an abstraction above watcher; something that starts a list and gets a bookmark, then starts a watcher from that bookmark and merges the stream. That way we might avoid making watcher much more complicated than it already is. This might require exposing some more parameters to watcher though.

The complexity of this PR (at the moment) would make watcher hard to maintain, and hard to test.

clux · 2025-04-21T11:12:17Z

kube-runtime/src/watcher.rs

+            }
+            // Attempt to asynchronously fetch events from the Watch Stream and cache them.
+            // If an error occurs at this stage, restart the list operation.
+            loop {


This is unfortunately awkward and complex 😢
The step function (which were intended to do a single step), can now potentially loop for a long time.

This step only buffers incremental events from the Watch stream into memory cache. The polling operation terminates immediately when no new events are available. Since the I/O operations are asynchronous and non-blocking, their overhead is negligible compared to the inherent latency of list operations. Therefore, this does not significantly increase the total list+watch processing time.

codecov · 2025-04-21T11:30:32Z

Codecov Report

Attention: Patch coverage is 13.20755% with 92 lines in your changes missing coverage. Please review.

Project coverage is 75.2%. Comparing base (d171d26) to head (212083c).

Files with missing lines	Patch %	Lines
kube-runtime/src/watcher.rs	13.3%	92 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #1748     +/-   ##
=======================================
- Coverage   76.2%   75.2%   -0.9%     
=======================================
  Files         84      84             
  Lines       7852    7953    +101     
=======================================
+ Hits        5977    5978      +1     
- Misses      1875    1975    +100

Files with missing lines	Coverage Δ
kube-runtime/src/watcher.rs	`31.0% <13.3%> (-13.6%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SOF3 · 2025-04-22T02:19:38Z

kube-runtime/src/watcher.rs

+                        InitialListStrategy::ListWatchParallel => {
+                            // start watch
+                            match api
+                                .watch(&wc.to_watch_params(), &last_bookmark.clone().unwrap())


Is this unwrap fallible?

There's a preceding last_bookmark.is_none() check, so this unwrap is guaranteed to be safe here.

The preceding check only fails if continue is also none.

you are right. I will fix it later

SOF3 · 2025-04-22T02:23:14Z

kube-runtime/src/watcher.rs

+                            {
+                                Ok(stream) => (None, State::WatchedInitPage {
+                                    continue_token,
+                                    objects: list.items.into_iter().collect(),


VecDeque<T> implements From<Vec<T>>

SOF3 · 2025-04-22T02:31:33Z

Are you listing with resourceVersion=0 or not in the first place?

If you set resourceVersion="0", apiserver ignores the limit param and always returns a full list, because the apiserver local cache doesn't support pagination. You can verify this by running kubectl get --raw /api/v1/pods?limit=10\&resourceVersion=0; it will return way more than 10 items.
If you set resourceVersion="", apiserver has to penetrate the local cache and list from etcd directly, which is exactly the reason why your list is so slow.

xiaohuanshu · 2025-04-22T03:33:19Z

Hey there. That's a high-scale use-case you have there 😄 Thanks for raising this.

This is a complex solution (to a problem I didn't know existed), and therefore I have a couple of questions/concerns about this, and a potential alternate approach.

Basically;

Do you think this is temporary?

Is there precedent for this in the go ecosystem?

More specifically;

Would this problem be avoided with StreamingList? Streaming should give you bookmarks in between InitApply events and might make this solution unnecessary. Possibly, you are blocked by the feature being widely available? It is still beta in 1.33.

We are probably not the first runtime library to run into this situation of cluster size and large lists. Do you know if controller-runtime / client-go has setups for this? I'm curious because we have guarantees that watcher needs to uphold for reflector, and this technically creates a watcher mode that breaks these guarantees (as well as pushing more complexity into the state machinery of reflector).

Potential idea

It feels like the parallelism could be achieved by merging two streams with an abstraction above watcher; something that starts a list and gets a bookmark, then starts a watcher from that bookmark and merges the stream. That way we might avoid making watcher much more complicated than it already is. This might require exposing some more parameters to watcher though.

The complexity of this PR (at the moment) would make watcher hard to maintain, and hard to test.

Thank you for the thoughtful review!

While StreamingList is indeed a better and more recommended solution, it's still in beta. Even after its official release, it may take several years for existing clusters to complete upgrades.

After reviewing the client-go code, I confirm there's currently no logic to start watch after the first pagination.
However, the Kubernetes official API operation recommendations (https://kubernetes.io/docs/reference/using-api/api-concepts/#retrieving-large-results-sets-in-chunks) mention: "Notice that the resourceVersion of the collection remains constant across each request......This allows you to break large requests into smaller chunks and then perform a watch operation on the full set without missing any updates."
Based on this guidance, I developed this feature. I have a cluster with 25,000+ pods where we consistently encounter version expiration after list operations, resulting in 100% 410 Gone errors during watch. This feature was previously manually implemented in a Python SDK project and has been running stably. Recently during migration to Rust and kube-rs, I noticed the absence of this functionality and thus submitted this PR.

Perhaps my large cluster scenario is relatively uncommon, or maybe it's due to kube-rs currently using JSON instead of protobuf which might make list serialization slower. In any case, I'm mainly providing a problem-solving approach for reference~

xiaohuanshu · 2025-04-22T03:51:37Z

Are you listing with resourceVersion=0 or not in the first place?

If you set resourceVersion="0", apiserver ignores the limit param and always returns a full list, because the apiserver local cache doesn't support pagination. You can verify this by running kubectl get --raw /api/v1/pods?limit=10\&resourceVersion=0; it will return way more than 10 items.

If you set resourceVersion="", apiserver has to penetrate the local cache and list from etcd directly, which is exactly the reason why your list is so slow.

Thank you for the suggestion. I'm using resourceVersion="" and have verified the pagination logic is correctly implemented. One potential factor I've observed is that kube-rs currently uses JSON instead of Protobuf, which likely causes poor performance during list deserialization. This manifests as a 100% failure rate on my cluster with 25,000+ pods.

SOF3 · 2025-04-22T07:09:53Z

If your cluster is already so large, penetrating apiserver cache would be a bad idea since this would easily result in apiserver OOM or even etcd OOM during e.g. many controllers are relisting concurrently.

Some alternative solutions:

increase watch cache size in your apiserver configuration
start multiple reflectors with separate label/namespace selectors, at the cost of loss of inter-object strong consistency (which you apparently do not need either)

xiaohuanshu · 2025-04-22T08:57:40Z

If your cluster is already so large, penetrating apiserver cache would be a bad idea since this would easily result in apiserver OOM or even etcd OOM during e.g. many controllers are relisting concurrently.

Some alternative solutions:

increase watch cache size in your apiserver configuration

start multiple reflectors with separate label/namespace selectors, at the cost of loss of inter-object strong consistency (which you apparently do not need either)

Strong consistency must be guaranteed. The first-pagination-initiated watch approach has been running stably for over 2 years in my Python-based production system, demonstrating production-grade reliability.

clux reviewed Apr 21, 2025

View reviewed changes

SOF3 reviewed Apr 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748

feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748

Uh oh!

xiaohuanshu commented Apr 21, 2025

Uh oh!

clux left a comment

Uh oh!

clux Apr 21, 2025 •

edited

Loading

Uh oh!

xiaohuanshu Apr 22, 2025

Uh oh!

codecov bot commented Apr 21, 2025

Uh oh!

SOF3 Apr 22, 2025

Uh oh!

xiaohuanshu Apr 22, 2025

Uh oh!

SOF3 Apr 22, 2025

Uh oh!

xiaohuanshu Apr 22, 2025

Uh oh!

SOF3 Apr 22, 2025 •

edited

Loading

Uh oh!

SOF3 commented Apr 22, 2025 •

edited

Loading

Uh oh!

xiaohuanshu commented Apr 22, 2025 •

edited

Loading

Potential idea

Uh oh!

xiaohuanshu commented Apr 22, 2025

Uh oh!

SOF3 commented Apr 22, 2025 •

edited

Loading

Uh oh!

xiaohuanshu commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748

Are you sure you want to change the base?

feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748

Uh oh!

Conversation

xiaohuanshu commented Apr 21, 2025

Motivation

Solution

Uh oh!

clux left a comment

Choose a reason for hiding this comment

Potential idea

Uh oh!

clux Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohuanshu Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 21, 2025

Codecov Report

Uh oh!

SOF3 Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

xiaohuanshu Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

SOF3 Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

xiaohuanshu Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

SOF3 Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SOF3 commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohuanshu commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Potential idea

Uh oh!

xiaohuanshu commented Apr 22, 2025

Uh oh!

SOF3 commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohuanshu commented Apr 22, 2025

Uh oh!

Uh oh!

clux Apr 21, 2025 •

edited

Loading

SOF3 Apr 22, 2025 •

edited

Loading

SOF3 commented Apr 22, 2025 •

edited

Loading

xiaohuanshu commented Apr 22, 2025 •

edited

Loading

SOF3 commented Apr 22, 2025 •

edited

Loading