Linkerd viz prometheus attempts to scrape metrics from completed argo workflow pods #13346

bwmetcalf · 2024-11-19T03:20:40Z

What is the issue?

If argo workflow pods are injected with linkerd-proxy, once they go into a completed state, viz prometheus will still attempt to scrape metrics from them resulting in a high rate of 504s

{"caller":"scrape.go:1400","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"linkerd-proxy","target":"http://10.3.136.62:4191/metrics","ts":"2024-11-19T01:47:24.385Z"}

linkerd prometheus should be smart enough to not attempt to scrape metrics from completed pods. Argo server has the ability to keep a configurable number of workflow pods before they are deleted which is desirable for troubleshooting, for example.

How can it be reproduced?

Create an meshed argo workflow pod and when it completes prometheus will try to scrape metrics against an unresponsive pod and throw a 504.

Logs, error output, etc

See above.

output of `linkerd check -o short`

% linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane and cli versions match
    control plane running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
    linkerd-destination-5ddc58f9bc-5x9nh running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ viz extension proxies and cli versions match
    metrics-api-5789bcc5d-2zdck running edge-24.11.3 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Server Version: v1.29.8-eks-a737599

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

stale · 2025-02-17T03:38:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

adleong · 2025-02-26T23:03:53Z

Thanks for raising this, @bwmetcalf. Linkerd-viz is bundled with an off-the-shelf Prometheus instance, which does the scraping. Do you know if Prometheus supports configuration to not scrape complete pods?

bwmetcalf · 2025-02-27T00:03:09Z

@adleong I did find prometheus-operator/prometheus-operator#5049 which appears to already be in the release we are running. I originally thought that this change was not filtering out pods in Completed status. However, for a completed argo workflow pod, kubectl get pod shows Completed but the actual status from the output of kubectl describe pod is Succeeded.

% k get pods -n argo-workflows downlink-umbra09-tr17-bn5ol-parse-downlink-132249315
NAME                                                   READY   STATUS      RESTARTS   AGE
downlink-umbra09-tr17-bn5ol-parse-downlink-132249315   0/3     Completed   0          16m

% k describe pod -n argo-workflows downlink-umbra09-tr17-bn5ol-parse-downlink-132249315|grep Status:
Status:           Succeeded

The MR I referenced does indeed filter out Succeeded pods, so it's not clear if something else is going on or Completed needs to be added to the filter.

Let me dig into this a bit more and update here. I'm happy to work on a fix if I can correctly identify the issue.

Thanks!

bwmetcalf added the bug label Nov 19, 2024

stale bot added the wontfix label Feb 17, 2025

stale bot removed the wontfix label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd viz prometheus attempts to scrape metrics from completed argo workflow pods #13346

Linkerd viz prometheus attempts to scrape metrics from completed argo workflow pods #13346

bwmetcalf commented Nov 19, 2024

stale bot commented Feb 17, 2025

adleong commented Feb 26, 2025

bwmetcalf commented Feb 27, 2025

Linkerd viz prometheus attempts to scrape metrics from completed argo workflow pods #13346

Linkerd viz prometheus attempts to scrape metrics from completed argo workflow pods #13346

Comments

bwmetcalf commented Nov 19, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

stale bot commented Feb 17, 2025

adleong commented Feb 26, 2025

bwmetcalf commented Feb 27, 2025

output of `linkerd check -o short`