forked from kserve/modelmesh-serving
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oauth-proxy's CPU limit is far too low #62
Comments
fcami
added a commit
to fcami/modelmesh-serving
that referenced
this issue
Feb 2, 2023
With a CPU limit of 100m, oauth-proxy seems unable to cope with the load associated with its own liveness probes, leading to the model mesh pods being restarted every so often once a certain number of routes (inference services) are created. Raise the CPU limit from 100m to 2. Raise the CPU request from 100m to 0.5. Related-to: opendatahub-io#62 Related-to: opendatahub-io#16 Signed-off-by: François Cami <[email protected]>
This was referenced Feb 2, 2023
heyselbi
added this to
Internal tracking, ODH Feature Tracking and ODH Model Serving Planning
Oct 4, 2023
@fcami is this still an issue? What's the reason for higher limit for oauth-proxy? |
The reason is explained in the original post. |
Maybe we can increase the limit a little bit, about 500mi.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.
To Reproduce
Steps to reproduce the behavior:
Deploy about 80-120 inference models with routes.
A config that ALWAYS reproduces the problem:
6 namespaces, each with 2 model mesh pods, and 800 inference models per
MINIO_MODEL_COUNT=800
NS_COUNT=6
Expected behavior
All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.
Actual behavior
oauth liveness probes are missed:
Leading to:
And obviously:
In fact, all model mesh instances (pods) are unstable, due to the oauth-proxy container failing its liveness probes.
Environment (please complete the following information):
ODH
Additional context
The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints.
As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.
The text was updated successfully, but these errors were encountered: