Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc-controller pod keeps dying #143

Open
boldandbusted opened this issue Feb 21, 2025 · 5 comments
Open

doc-controller pod keeps dying #143

boldandbusted opened this issue Feb 21, 2025 · 5 comments

Comments

@boldandbusted
Copy link

boldandbusted commented Feb 21, 2025

Howdy. I'm using this repo's controller template to learn more about the K8s Operator pattern, so apologies in advance if I don't report the right info first time around. This is in minikube, run with the following options:

$ minikube start -n 2 --memory=16G --cni='cilium' --cpus=4 -c='containerd'

I've gone through the README's instructions, and I see that the pod starts, stays alive for a bit, then dies. Here's the pod's log:

2025-02-21T01:44:06.726554Z  INFO actix_server::builder: starting 1 workers
2025-02-21T01:44:06.726730Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-02-21T01:44:06.726735Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:8080", workers: 1, listening on: 0.0.0.0:8080
2025-02-21T01:44:06.727620Z DEBUG HTTP: kube_client::client::builder: requesting http.method=GET http.url=https://10.96.0.1/apis/kube.rs/v1/documents?&limit=1 otel.name="list" otel.kind="client"
2025-02-21T01:44:36.728858Z ERROR HTTP: kube_client::client::builder: failed with error client error (Connect) http.method=GET http.url=https://10.96.0.1/apis/kube.rs/v1/documents?&limit=1 otel.name="list" otel.kind="client" otel.status_code="ERROR"
2025-02-21T01:44:36.728883Z ERROR controller::controller: CRD is not queryable; Service(hyper_util::client::legacy::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) })). Is the CRD installed?
2025-02-21T01:44:36.728888Z  INFO controller::controller: Installation: cargo run --bin crdgen | kubectl apply -f -
Stream closed EOF for default/doc-controller-8449856bdc-9wr7j (doc-controller)

I've verified that I've got the CRDs installed.

$ kubectl get crds -A | xclip
NAME                                         CREATED AT
alertmanagerconfigs.monitoring.coreos.com    2025-02-20T23:57:55Z
alertmanagers.monitoring.coreos.com          2025-02-20T23:57:56Z
ciliumcidrgroups.cilium.io                   2025-02-20T23:41:10Z
ciliumclusterwidenetworkpolicies.cilium.io   2025-02-20T23:41:11Z
ciliumendpoints.cilium.io                    2025-02-20T23:41:10Z
ciliumexternalworkloads.cilium.io            2025-02-20T23:41:10Z
ciliumidentities.cilium.io                   2025-02-20T23:41:10Z
ciliuml2announcementpolicies.cilium.io       2025-02-20T23:41:10Z
ciliumloadbalancerippools.cilium.io          2025-02-20T23:41:10Z
ciliumnetworkpolicies.cilium.io              2025-02-20T23:41:11Z
ciliumnodeconfigs.cilium.io                  2025-02-20T23:41:10Z
ciliumnodes.cilium.io                        2025-02-20T23:41:10Z
ciliumpodippools.cilium.io                   2025-02-20T23:41:10Z
documents.kube.rs                            2025-02-20T23:43:19Z
podmonitors.monitoring.coreos.com            2025-02-20T23:57:56Z
probes.monitoring.coreos.com                 2025-02-20T23:57:56Z
prometheusagents.monitoring.coreos.com       2025-02-20T23:57:56Z
prometheuses.monitoring.coreos.com           2025-02-20T23:57:56Z
prometheusrules.monitoring.coreos.com        2025-02-20T23:57:56Z
scrapeconfigs.monitoring.coreos.com          2025-02-20T23:57:56Z
servicemonitors.monitoring.coreos.com        2025-02-20T23:57:56Z
thanosrulers.monitoring.coreos.com           2025-02-20T23:57:57Z

I can see the page via port-forwarding. (see screenshot)

Image

Even /metrics:

# HELP doc_ctrl_reconcile_duration_seconds reconcile duration.
# TYPE doc_ctrl_reconcile_duration_seconds histogram
# UNIT doc_ctrl_reconcile_duration_seconds seconds
doc_ctrl_reconcile_duration_seconds_sum 0.0
doc_ctrl_reconcile_duration_seconds_count 0
doc_ctrl_reconcile_duration_seconds_bucket{le="0.01"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="0.1"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="0.25"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="0.5"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="1.0"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="5.0"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="15.0"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="60.0"} 0
doc_ctrl_reconcile_duration_seconds_bucket{le="+Inf"} 0
# HELP doc_ctrl_reconcile_failures reconciliation errors.
# TYPE doc_ctrl_reconcile_failures counter
# HELP doc_ctrl_reconcile_runs reconciliations.
# TYPE doc_ctrl_reconcile_runs counter
doc_ctrl_reconcile_runs_total 0
# EOF

Do please let me know if there's anything I can do to aid in troubleshooting this. Thanks!

@boldandbusted
Copy link
Author

Turning DEBUG on for every match didn't add much - just one line from tower::buffer::worker:

2025-02-21T02:36:40.711562Z  INFO actix_server::builder: starting 1 workers
2025-02-21T02:36:40.711775Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
2025-02-21T02:36:40.711777Z DEBUG tower::buffer::worker: service.ready=true processing request
2025-02-21T02:36:40.711797Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:8080", workers: 1, listening on: 0.0.0.0:8080
2025-02-21T02:36:40.712528Z DEBUG HTTP: kube_client::client::builder: requesting http.method=GET http.url=https://10.96.0.1/apis/kube.rs/v1/documents?&limit=1 otel.name="list" otel.kind="client"
2025-02-21T02:36:40.712598Z DEBUG HTTP: hyper_util::client::legacy::connect::http: connecting to 10.96.0.1:443 http.method=GET http.url=https://10.96.0.1/apis/kube.rs/v1/documents?&limit=1 otel.name="list" otel.kind="client"
2025-02-21T02:37:10.713383Z ERROR HTTP: kube_client::client::builder: failed with error client error (Connect) http.method=GET http.url=https://10.96.0.1/apis/kube.rs/v1/documents?&limit=1 otel.name="list" otel.kind="client" otel.status_code="ERROR"
2025-02-21T02:37:10.713420Z ERROR controller::controller: CRD is not queryable; Service(hyper_util::client::legacy::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) })). Is the CRD installed?
2025-02-21T02:37:10.713425Z  INFO controller::controller: Installation: cargo run --bin crdgen | kubectl apply -f -
Stream closed EOF for default/doc-controller-6449788d9c-cpwfs (doc-controller)

@boldandbusted
Copy link
Author

And, FWIW, I enabled Hubble to see if there's any communication being blocked and I see none. No other odd Events on the cluster. Calling it a night on this. Thanks for the attention! :)

@clux
Copy link
Member

clux commented Feb 21, 2025

hm, the error should originate from the api.list call on the crd, so if it actually is installed, then it's likely network policies. Timeout is also indicative of this.

Maybe try turning off the netpol if you're installing it with that option. The netpol that i've wrote is not really tuned for cilium because it's doing the awkward thing of fetching the apiserver IP (edit: that's ci only, it should work for cilium with 0.0.0.0). Wrote some notes on this in the netpol guide before. Basically, you should be able to use toEntities: [ kube-apiserver ] with cilium netpols.

@boldandbusted
Copy link
Author

Thank you, @clux! I will see about converting the Network Policy to the Cilium schema and see how much I break. :D

@boldandbusted
Copy link
Author

P.S. If there's something around this I can contribute back, I'm happy to, if the contribution is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants