Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd viz prometheus attempts to scrape metrics from completed argo workflow pods #13346

Open
bwmetcalf opened this issue Nov 19, 2024 · 3 comments
Labels

Comments

@bwmetcalf
Copy link

What is the issue?

If argo workflow pods are injected with linkerd-proxy, once they go into a completed state, viz prometheus will still attempt to scrape metrics from them resulting in a high rate of 504s

{"caller":"scrape.go:1400","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"linkerd-proxy","target":"http://10.3.136.62:4191/metrics","ts":"2024-11-19T01:47:24.385Z"}

linkerd prometheus should be smart enough to not attempt to scrape metrics from completed pods. Argo server has the ability to keep a configurable number of workflow pods before they are deleted which is desirable for troubleshooting, for example.

How can it be reproduced?

Create an meshed argo workflow pod and when it completes prometheus will try to scrape metrics against an unresponsive pod and throw a 504.

Logs, error output, etc

See above.

output of linkerd check -o short

% linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane and cli versions match
    control plane running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
    linkerd-destination-5ddc58f9bc-5x9nh running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ viz extension proxies and cli versions match
    metrics-api-5789bcc5d-2zdck running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Server Version: v1.29.8-eks-a737599

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@bwmetcalf bwmetcalf added the bug label Nov 19, 2024
Copy link

stale bot commented Feb 17, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 17, 2025
@adleong
Copy link
Member

adleong commented Feb 26, 2025

Thanks for raising this, @bwmetcalf. Linkerd-viz is bundled with an off-the-shelf Prometheus instance, which does the scraping. Do you know if Prometheus supports configuration to not scrape complete pods?

@stale stale bot removed the wontfix label Feb 26, 2025
@bwmetcalf
Copy link
Author

@adleong I did find prometheus-operator/prometheus-operator#5049 which appears to already be in the release we are running. I originally thought that this change was not filtering out pods in Completed status. However, for a completed argo workflow pod, kubectl get pod shows Completed but the actual status from the output of kubectl describe pod is Succeeded.

% k get pods -n argo-workflows downlink-umbra09-tr17-bn5ol-parse-downlink-132249315
NAME                                                   READY   STATUS      RESTARTS   AGE
downlink-umbra09-tr17-bn5ol-parse-downlink-132249315   0/3     Completed   0          16m
% k describe pod -n argo-workflows downlink-umbra09-tr17-bn5ol-parse-downlink-132249315|grep Status:
Status:           Succeeded

The MR I referenced does indeed filter out Succeeded pods, so it's not clear if something else is going on or Completed needs to be added to the filter.

Let me dig into this a bit more and update here. I'm happy to work on a fix if I can correctly identify the issue.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants