Watch: implement list-then-watch pattern for resilient resource_version management#2519
Watch: implement list-then-watch pattern for resilient resource_version management#2519
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…: stale resourceVersion after 410) Co-authored-by: brendandburns <5751682+brendandburns@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Copilot The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…cluded_params set Co-authored-by: brendandburns <5751682+brendandburns@users.noreply.github.com>
|
/assign |
| retry_after_410 = False | ||
| deserialize = kwargs.pop('deserialize', True) | ||
|
|
||
| # If no resource_version is specified for a watch (not a log follow), |
There was a problem hiding this comment.
When Watch.stream() is called without a resource_version, the stored self.resource_version reflects the last individual object event seen (e.g., a namespace last modified weeks ago). After 30–60 min of inactivity, this stale version falls outside the API server's watch cache, causing a 410 on retry and an unrecoverable ApiException.
I agree with the problem statement. Without this PR, when the problem happens, a user needs to carefully decide how to resume watching.
- Watching with
resourceVersion: Noneis obviously not reliable - First doing a LIST, then watching with the RV from the LIST is preferable. This is also how watch should be used at the very first place.
I'm concerned about the behavior change in this PR.
- For users who know how to list and watch, their applications probably already perform list-and-watch upon start, and also when 410 happens. Basically, this PR won't affect these users.
- For users who don't do list and watch in their application, I'm concerned if their application would tolerate this new behavior. I'm not sure what use cases there are when calling watch with
resourceVersion: None, and I'm not sure if this PR would break their use cases... I almost feel if we should have validated thatresourceVersioncannot beNone. @brendandburns WDYT?
When
Watch.stream()is called without aresource_version, the storedself.resource_versionreflects the last individual object event seen (e.g., a namespace last modified weeks ago). After 30–60 min of inactivity, this stale version falls outside the API server's watch cache, causing a 410 on retry and an unrecoverableApiException.Changes
watch.py— list-then-watch pattern: Before entering the watch loop, when noresource_versionis specified, performs an initial list call (watch=False) and extractsmetadata.resource_versionfrom the list response. This collection-level version is always current and valid for watch restarts. Parameters irrelevant to listing (watch,_preload_content,allow_watch_bookmarks,timeout_seconds) are excluded from the list call.Initial items yielded as
ADDEDevents: Existing items from the list are emitted before the watch stream begins, maintaining backward-compatible behavior. Respects thedeserializeflag.watch_test.py— updated assertions + new test: Six existing tests updated to account for the two-call pattern (list + watch). New testtest_watch_with_initial_list_resource_versionvalidates end-to-end: existing items from the list appear first asADDEDevents, the watch starts from the list'sresource_version, and subsequent events are delivered correctly.Example
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.