You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If argo workflow pods are injected with linkerd-proxy, once they go into a completed state, viz prometheus will still attempt to scrape metrics from them resulting in a high rate of 504s
{"caller":"scrape.go:1400","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"linkerd-proxy","target":"http://10.3.136.62:4191/metrics","ts":"2024-11-19T01:47:24.385Z"}
linkerd prometheus should be smart enough to not attempt to scrape metrics from completed pods. Argo server has the ability to keep a configurable number of workflow pods before they are deleted which is desirable for troubleshooting, for example.
How can it be reproduced?
Create an meshed argo workflow pod and when it completes prometheus will try to scrape metrics against an unresponsive pod and throw a 504.
Logs, error output, etc
See above.
output of linkerd check -o short
% linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
unsupported version channel: stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane and cli versions match
control plane running edge-24.11.3 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
linkerd-destination-5ddc58f9bc-5x9nh running edge-24.11.3 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints
linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints
linkerd-viz
-----------
‼ viz extension proxies and cli versions match
metrics-api-5789bcc5d-2zdck running edge-24.11.3 but cli running stable-2.14.10
see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
Server Version: v1.29.8-eks-a737599
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Thanks for raising this, @bwmetcalf. Linkerd-viz is bundled with an off-the-shelf Prometheus instance, which does the scraping. Do you know if Prometheus supports configuration to not scrape complete pods?
@adleong I did find prometheus-operator/prometheus-operator#5049 which appears to already be in the release we are running. I originally thought that this change was not filtering out pods in Completed status. However, for a completed argo workflow pod, kubectl get pod shows Completed but the actual status from the output of kubectl describe pod is Succeeded.
% k get pods -n argo-workflows downlink-umbra09-tr17-bn5ol-parse-downlink-132249315
NAME READY STATUS RESTARTS AGE
downlink-umbra09-tr17-bn5ol-parse-downlink-132249315 0/3 Completed 0 16m
% k describe pod -n argo-workflows downlink-umbra09-tr17-bn5ol-parse-downlink-132249315|grep Status:
Status: Succeeded
The MR I referenced does indeed filter out Succeeded pods, so it's not clear if something else is going on or Completed needs to be added to the filter.
Let me dig into this a bit more and update here. I'm happy to work on a fix if I can correctly identify the issue.
What is the issue?
If argo workflow pods are injected with linkerd-proxy, once they go into a completed state, viz prometheus will still attempt to scrape metrics from them resulting in a high rate of 504s
linkerd prometheus should be smart enough to not attempt to scrape metrics from completed pods. Argo server has the ability to keep a configurable number of workflow pods before they are deleted which is desirable for troubleshooting, for example.
How can it be reproduced?
Create an meshed argo workflow pod and when it completes prometheus will try to scrape metrics against an unresponsive pod and throw a 504.
Logs, error output, etc
See above.
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: