-
-
Notifications
You must be signed in to change notification settings - Fork 342
feat(watcher): introduce ListWatchParallel mode to handle large-scale resources #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… resources Initiate Watch immediately after the first List pagination completes and cache events to prevent resource version expiration when listing large-scale resources. - Add InitialListStrategy::ListWatchParallel strategy - Implement parallel List pagination and Watch event processing with caching - Introduce WatchedInitPage and WatchedInitListed state machine states Signed-off-by: xiaohuanshu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there. That's a high-scale use-case you have there 😄
Thanks for raising this.
This is a complex solution (to a problem I didn't know existed), and therefore I have a couple of questions/concerns about this, and a potential alternate approach.
Basically;
- Do you think this is temporary?
- Is there precedent for this in the go ecosystem?
More specifically;
- Would this problem be avoided with
StreamingList
? Streaming should give you bookmarks in betweenInitApply
events and might make this solution unnecessary. Possibly, you are blocked by the feature being widely available? It is still beta in 1.33. - We are probably not the first runtime library to run into this situation of cluster size and large lists. Do you know if controller-runtime / client-go has setups for this? I'm curious because we have guarantees that
watcher
needs to uphold forreflector
, and this technically creates awatcher
mode that breaks these guarantees (as well as pushing more complexity into the state machinery ofreflector
).
Potential idea
It feels like the parallelism could be achieved by merging two streams with an abstraction above watcher
; something that starts a list and gets a bookmark, then starts a watcher from that bookmark and merges the stream. That way we might avoid making watcher
much more complicated than it already is. This might require exposing some more parameters to watcher
though.
The complexity of this PR (at the moment) would make watcher hard to maintain, and hard to test.
} | ||
// Attempt to asynchronously fetch events from the Watch Stream and cache them. | ||
// If an error occurs at this stage, restart the list operation. | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunately awkward and complex 😢
The step
function (which were intended to do a single step), can now potentially loop for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step only buffers incremental events from the Watch stream into memory cache. The polling operation terminates immediately when no new events are available. Since the I/O operations are asynchronous and non-blocking, their overhead is negligible compared to the inherent latency of list operations. Therefore, this does not significantly increase the total list+watch processing time.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1748 +/- ##
=======================================
- Coverage 76.2% 75.2% -0.9%
=======================================
Files 84 84
Lines 7852 7953 +101
=======================================
+ Hits 5977 5978 +1
- Misses 1875 1975 +100
🚀 New features to boost your workflow:
|
InitialListStrategy::ListWatchParallel => { | ||
// start watch | ||
match api | ||
.watch(&wc.to_watch_params(), &last_bookmark.clone().unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this unwrap fallible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a preceding last_bookmark.is_none() check, so this unwrap is guaranteed to be safe here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preceding check only fails if continue
is also none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. I will fix it later
{ | ||
Ok(stream) => (None, State::WatchedInitPage { | ||
continue_token, | ||
objects: list.items.into_iter().collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VecDeque<T>
implements From<Vec<T>>
Are you listing with resourceVersion=0 or not in the first place?
|
Thank you for the thoughtful review! While StreamingList is indeed a better and more recommended solution, it's still in beta. Even after its official release, it may take several years for existing clusters to complete upgrades. After reviewing the client-go code, I confirm there's currently no logic to start watch after the first pagination. Perhaps my large cluster scenario is relatively uncommon, or maybe it's due to kube-rs currently using JSON instead of protobuf which might make list serialization slower. In any case, I'm mainly providing a problem-solving approach for reference~ |
Thank you for the suggestion. I'm using resourceVersion="" and have verified the pagination logic is correctly implemented. One potential factor I've observed is that kube-rs currently uses JSON instead of Protobuf, which likely causes poor performance during list deserialization. This manifests as a 100% failure rate on my cluster with 25,000+ pods. |
If your cluster is already so large, penetrating apiserver cache would be a bad idea since this would easily result in apiserver OOM or even etcd OOM during e.g. many controllers are relisting concurrently. Some alternative solutions:
|
Strong consistency must be guaranteed. The first-pagination-initiated watch approach has been running stably for over 2 years in my Python-based production system, demonstrating production-grade reliability. |
Motivation
When monitoring large-scale Kubernetes resources, the traditional ListWatch pattern requires waiting for full list completion before starting watch operations. This creates a significant risk of the resource_version becoming expired during long list operations, resulting in 410 Gone errors that force full re-list operations. The problem intensifies in clusters with 10,000+ resources where initial list operations may take minutes.
Solution
Introduces a new ListWatchParallel strategy that:
Initiates watch immediately after first list pagination
Processes subsequent list paginations and watch events in parallel
Guarantees ordered delivery of list data before cached watch events
Maintains version consistency using the first pagination's resource version
Added InitialListStrategy::ListWatchParallel variant
Extended state machine with WatchedInitPage (parallel processing) and WatchedInitListed (cached event drain) states
Implemented event buffering via stream_events queue