-
Notifications
You must be signed in to change notification settings - Fork 4.6k
HPA recipe for AI inference server using custom metrics #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/assign @janetkuo |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: seans3 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this practical example for autoscaling!
ai/vllm-deployment/hpa/README.md
Outdated
|
||
## Prerequisites | ||
|
||
This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that default
namespace is used for vLLM. Kubernetes best practice is to avoid deploying applications in the default
namespace. Using it for actual workloads can lead to significant operational and security challenges as the cluster usage grows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is already pretty big; should I change the vLLM deployment in a separate PR to use a namespace (then return to this)? Or should I fix it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the parent directory vLLM deployment to install to a non-default namespace. I have updated all the configuration and instructions in this PR to reflect that change. Please have a look.
ai/vllm-deployment/hpa/README.md
Outdated
▼ | ||
┌────────────────┐ | ||
│ PrometheusRule │ | ||
└────────────────┘ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for the diagram to make it more clear:
- Use numbered steps and arrow directions to guide the user through the precise data flow (scrape, evaluate, record, query, scale) from start to finish
- The flow hides the crucial transformation step where a raw metric is converted into a processed metric. Recommend to clearly label the initial scrape with the raw DCGM metric name and the query from the adapter with the new, processed metric name.
- The PrometheusRule is shown as a final step in the "GPU Path Only". However, the PrometheusRule is not a destination for data; it's a configuration that tells the Prometheus Server how to perform an internal calculation
- Include the Kubernetes API Server between the adapter and HPA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that GitHub supports mermaid diagrams in markdown: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams
Might be easier to edit than ASCII diagrams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call. I've added a mermaid diagram, and I believe it now addresses the issues you raised. Please let me know what you think.
ai/vllm-deployment/hpa/README.md
Outdated
|
||
## II. HPA for vLLM AI Inference Server using NVidia GPU metrics | ||
|
||
[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could discuss the trade-offs between these 2 metrics options here, and how to combine multiple metrics for robustness (e.g., scale up if either the number of running requests exceeds a certain threshold or GPU utilization spikes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added significant documentation now addressing the metric trade-offs as well as the combination of multiple metrics. Please let me know what you think.
averageValue: 20 | ||
behavior: | ||
scaleUp: | ||
stabilizationWindowSeconds: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss the trade-offs here, e.g. the risk of over-scaling vs. with highly volatile workloads where immediate scaling up is critical to maintain performance and responsiveness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments in the YAML of the trade-offs for the scale-up and scale-down behavior (also the scale-down behavior for the vLLM HPA). Please let me know what you think.
# the labels on the 'gke-managed-dcgm-exporter' Service. | ||
selector: | ||
matchLabels: | ||
app.kubernetes.io/name: gke-managed-dcgm-exporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the label value need to be GKE specific? Can this be more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GKE gives the user this DCGM Exporter
for free, since its always on NVidia GPU nodes. But for the other two major cloud providers, the user has to install it. Trying to install it manually in GKE, however, causes conflicts with the exporter already on the nodes. I've added comments about how other cloud providers need to install it, and I've called out the GKE-specific namespace/labels. Please let me know what you think.
This PR extends the existing vLLM server example by introducing a complete Horizontal Pod Autoscaling (HPA) solution, contained within a new
hpa/
directory. This provides a production-ready pattern for automatically scaling the AI inference server based on real-time demand.Two distinct autoscaling methods are provided:
How It Works
The solution uses a standard Prometheus-based monitoring pipeline. The Prometheus Operator scrapes metrics from either the vLLM server or the NVIDIA DCGM exporter. For GPU metrics, a PrometheusRule is used to relabel the raw data, making it compatible with the HPA. The Prometheus Adapter then exposes these metrics to the Kubernetes Custom Metrics API, which the HPA controller consumes to drive scaling decisions.
What's New
hpa/
directory: Contains all new manifests and documentation.support for the GPU example.
How to Test
Detailed instructions and verification steps are available in the new guides: