oauth-proxy's CPU limit is far too low #62

fcami · 2023-02-02T10:32:32Z

Describe the bug

oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.

To Reproduce
Steps to reproduce the behavior:

Deploy about 80-120 inference models with routes.
A config that ALWAYS reproduces the problem:
6 namespaces, each with 2 model mesh pods, and 800 inference models per
MINIO_MODEL_COUNT=800
NS_COUNT=6

Expected behavior

All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.

Actual behavior

oauth liveness probes are missed:

  Warning  Unhealthy  5m50s (x664 over 23h)  kubelet  Liveness probe failed: Get "https://10.131.0.35:8443/oauth/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Leading to:

  Warning  BackOff    10m (x5667 over 22h)   kubelet  Back-off restarting failed container

And obviously:

modelmesh-serving-ovms-1.x-5bbbf88fdf-spxlw   4/5     CrashLoopBackOff   490 (3m27s ago)   23h

In fact, all model mesh instances (pods) are unstable, due to the oauth-proxy container failing its liveness probes.

Environment (please complete the following information):

ODH

quay.io/opendatahub/rest-proxy:v0.9.3-auth
openvino/model_server:2022.2
quay.io/opendatahub/modelmesh-runtime-adapter:v0.9.3-auth
quay.io/opendatahub/modelmesh:v0.9.3-auth
registry.redhat.io/openshift4/ose-oauth-proxy:v4.8

Additional context

The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints.
As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.

The text was updated successfully, but these errors were encountered:

With a CPU limit of 100m, oauth-proxy seems unable to cope with the load associated with its own liveness probes, leading to the model mesh pods being restarted every so often once a certain number of routes (inference services) are created. Raise the CPU limit from 100m to 2. Raise the CPU request from 100m to 0.5. Related-to: opendatahub-io#62 Related-to: opendatahub-io#16 Signed-off-by: François Cami <[email protected]>

heyselbi · 2023-12-05T19:23:11Z

@fcami is this still an issue? What's the reason for higher limit for oauth-proxy?

fcami · 2023-12-05T19:35:11Z

The reason is explained in the original post.
I cannot test anymore, so please do what you will.

spolti · 2025-01-07T18:01:15Z

Maybe we can increase the limit a little bit, about 500mi.

modelmesh-serving/config/internal/base/deployment.yaml.tmpl

Line 165 in 02c4341

cpu: 100m

This was referenced Feb 2, 2023

Oauth proxy limit kserve/modelmesh-serving#324

Closed

oauth-proxy: raise CPU request and limit. #63

Closed

heyselbi added this to Internal tracking, ODH Feature Tracking and ODH Model Serving Planning Oct 4, 2023

github-project-automation bot moved this to New/Backlog in ODH Model Serving Planning Oct 4, 2023

Jooho mentioned this issue Oct 19, 2023

Manifest readiness for Operator v2 #237

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oauth-proxy's CPU limit is far too low #62

oauth-proxy's CPU limit is far too low #62

fcami commented Feb 2, 2023

heyselbi commented Dec 5, 2023

fcami commented Dec 5, 2023

spolti commented Jan 7, 2025 •

edited

Loading

oauth-proxy's CPU limit is far too low #62

oauth-proxy's CPU limit is far too low #62

Comments

fcami commented Feb 2, 2023

heyselbi commented Dec 5, 2023

fcami commented Dec 5, 2023

spolti commented Jan 7, 2025 • edited Loading

spolti commented Jan 7, 2025 •

edited

Loading