Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop watches on controller stop #2983

Open
basti1302 opened this issue Oct 14, 2024 · 5 comments
Open

Stop watches on controller stop #2983

basti1302 opened this issue Oct 14, 2024 · 5 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@basti1302
Copy link

basti1302 commented Oct 14, 2024

When creating a controller with controller.Start(ctx); and later stopping it by cancelling the provided context, watches defined on the controller will continue to trigger their event handler. Since the watch is owned by the controller, I would expect all watches defined on the controller to be terminated once the controller is terminated.

A reproduction repository is here: https://github.com/dash0hq/controller-runtime-reproducer/tree/main

This can be reproduced with this test script contained in the repository, which makes sure to continuously create events that will trigger the watch (if it is active).

This produces the following output:

2024-10-14T14:48:39Z    INFO    setup    successfully created a new watch
2024-10-14T14:48:39Z    INFO    setup    starting manager
2024-10-14T14:48:39Z    INFO    Starting EventSource    {"controller": "example_controller", "source": "kind source: *v1.Pod"}
2024-10-14T14:48:39Z    INFO    Starting Controller    {"controller": "example_controller"}
2024-10-14T14:48:39Z    INFO    starting server    {"name": "health probe", "addr": ":8081"}
2024-10-14T14:48:39Z    INFO    controller-runtime.metrics    Starting metrics server
2024-10-14T14:48:39Z    INFO    controller-runtime.metrics    Serving metrics server    {"bindAddress": ":8080", "secure": false}
2024-10-14T14:48:39Z    INFO    received create event
...
2024-10-14T14:48:39Z    INFO    received create event
2024-10-14T14:48:39Z    INFO    Starting workers    {"controller": "example_controller", "worker count": 1}
2024-10-14T14:48:39Z    INFO    received update event
2024-10-14T14:48:49Z    INFO    received update event
2024-10-14T14:48:53Z    INFO    received update event
2024-10-14T14:49:09Z    INFO    setup    stopping controller/cancelling controller context
2024-10-14T14:49:09Z    INFO    setup    controller context has been cancelled
2024-10-14T14:49:09Z    INFO    Shutdown signal received, waiting for all workers to finish    {"controller": "example_controller"}
2024-10-14T14:49:09Z    INFO    All workers finished    {"controller": "example_controller"}
2024-10-14T14:49:09Z    INFO    setup    controller has been stopped
2024-10-14T14:49:23Z    INFO    received update event
2024-10-14T14:49:24Z    INFO    received update event
2024-10-14T14:49:24Z    INFO    received update event
2024-10-14T14:49:24Z    INFO    received delete event
2024-10-14T14:49:29Z    INFO    received create event
2024-10-14T14:49:29Z    INFO    received update event
2024-10-14T14:49:29Z    INFO    received update event
...

As you can see, the event handler receives updates after the controller has been stopped.


In case you are curious about the wider context: My actual use case for this is stopping/removing a watch dynamically. I want to handle create/update events for third party resource types (monitoring.coreos.com.PrometheusRule for example). I do not know in advance whether the third party CRD is deployed or not. Thus I have a reconciler watching apiextensionsv1.CustomResourceDefinition with a filter predicate. If the CRD in question is created, I start a new controller/reconciler watching that resource type. If the CRD is deleted later, I really would like to stop or remove the watch. Otherwise an error message is emitted to the logs every couple of seconds: "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch monitoring.coreos.com/v1, Kind=PrometheusRule: the server could not find the requested resource" logger="UnhandledError". This apparently happens within controller-runtime.

So far I have not found any way to stop or remove a watch.

These previous issues seem to be related, but it does not seem that any of them ever resulted in something that allows stopping watches.

@alvaroaleman
Copy link
Member

If the CRD is deleted later, I really would like to stop or remove the watch

This should be possible by calling RemoveInformer. It is your responsibility to make sure this is actually unused though.

Since the watch is owned by the controller, I would expect all watches defined on the controller to be terminated once the controller is terminated.

No, this is not correct. The controller owns the source but it does not own the watch, because the watch is shared among possibly many sources. It also looks like we don't properly stop the source, but it seems what you want is the watch to be stopped (which implicitly stops all sources that use the watch) and not only the source, so fixing that wouldn't actually solve your issue.

/retitle Stop watches on controller stop

@k8s-ci-robot k8s-ci-robot changed the title Watch event handler continues to be triggered after controller has been stopped Stop watches on controller stop Oct 14, 2024
@basti1302
Copy link
Author

Thanks for the fast response and the pointer to RemoveInformer. I was able to start and stop watching a third-party resource with this now. (For people having a similar requirement, the code is here).

Unfortunately, this only seems to work cleanly once. To be able to react correctly to arbitrary sequences of "third-party CRD is installed", "third-party CRD is deleted", "third-party CRD is installed", etc. there would need to be a way to revert the RemoveInformer call, that is, add the informer back and start it again I suppose.

@alvaroaleman
Copy link
Member

Re-Adding the informer is no problem, but the issue is that the source references the now-removed informer. You will have to re-construct the controller, starting the one you previously stopped will not work.

@basti1302
Copy link
Author

basti1302 commented Oct 27, 2024

Thanks a lot for the guidance, that works nicely! 👍

For other people with a similar use case, the updated code is here and a small test script is here.

Just for the record, this approach requires to use the SkipNameValidation option when (re-)creating the controller. I think the controller name is not removed from the set of used controller names when the controller is stopped by cancelling its context. Maybe that is by design? If it is not by design, do you want me to create a separate issue for that?

@alvaroaleman FYI I'm fine with closing this issue, since my actual use case is solved. But I assume you would like to keep it open for tracking the more general issue (stopping watches when the controller is stopped)?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants