-
Notifications
You must be signed in to change notification settings - Fork 140
Proposal for Multi-Cluster Inference Gateways #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bexxmodd The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @bexxmodd! |
Hi @bexxmodd. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @robscott |
/ok-to-test |
@bexxmodd can you please remove the .DS_Store files? |
|
||
## Summary | ||
|
||
Inference Gateways aim to provide efficient routing to LLM workloads running in Kubernetes. In practice, an Inference Gateway is a Gateway that conforms to the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/). This Gateway supports a new type of backend - InferencePool. When routing to an [InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/), the Gateway calls out to an “Endpoint Picker” referenced by the InferencePool to get instructions on which specific endpoint within the pool it should reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref to the conformance section instead of the overall docs site: https://gateway-api-inference-extension.sigs.k8s.io/concepts/conformance/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ it should reference./ it should route the request to.
### Goals | ||
|
||
Enable Inference Gateways to route to backends in multiple clusters | ||
Follow a pattern that is familiar to users of Multi-Cluster Services (MCS) and/or Gateways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to MCS
|
||
### Non-Goals | ||
|
||
Be overly prescriptive about implementation details - this should focus on the resulting UX and leave significant flexibility in how it is achieved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each should be a bullet and end with a period.
|
||
## Proposal | ||
|
||
The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will add support for having separate Endpoint Pickers in each cluster.
Can you provide additional context here. Separate as in separate from the EPP ref'd by an InferencePool with an export annotation?
#### InferencePool | ||
|
||
A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `Local` and `ClusterSet`. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only start with ClusterSet
until a use case for Local
or any other supported value is accepted.
|
||
1. Endpoint Pickers | ||
1. Model Server Endpoints | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We should also consider routing to a remote EPP through a remote cluster Gateway (see the original design doc appendix).
-
Why does InferencePoolImport need to know about the model server endpoints in a remote cluster? I would expect the local Gateway to route to remote EPPs or through a remote Gateway based on one of the following conditions:
- A local InferencePool exists and no local GPUs are available, e.g. EPP returns a 503/429.
- A local InferencePool exists and the local Gateway decides the request is better served by an InferencePoolImport.
- No local InferencePool exists but an InferencePoolImport exists with a status that indicates available GPU resources.
Note: The EPP protocol spec should be updated when this design is finalized (please create a tracker issue).
|
||
#### Consider FailOpen “Extended” support for multi-cluster | ||
|
||
Given the potential complexity of supporting a FailOpen mode for multi-cluster, we could consider this “Extended” or optional support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on MC failover being "Extended" support.
|
||
#### Metrics from model server endpoints | ||
|
||
In the case where a Gateway is aware of all model server endpoints, it could theoretically also track metrics for each of these endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates the work of the EPP.
|
||
#### Metrics from Endpoint Picker | ||
|
||
Since Gateways are ultimately deciding which Endpoint Picker to send traffic to, it could make sense for Endpoint Pickers to report back load/utilization data to the Gateway to help inform that decision. (This would reflect the utilization of model server Pods within the local InferencePool managed by each EPP). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EPP already exposes InferencePool metrics. It will be up to the implementation on how to use these metrics to make a routing decision.
|
||
## Alternative 1: MCS API for EPP | ||
|
||
If we lean into the idea that the only thing a Gateway needs to know is the Endpoint Picker endpoints and what cluster(s) they're associated with, we could build this on top of the MCS API. With this approach, the Endpoint Picker is exposed with a Multi-Cluster Service: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InferencePool is meant to be a replacement for a Service so it may seem counterintuitive for a user to create a Service to achieve multi-cluster inference.
|
||
***Draft*** | ||
|
||
## Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the proposal. overall it looks very nice and at the very high level it could work (not getting into the details).
I think the main part that is still missing here is the motivation.
the only motivation that was mentioned in this doc is that the cluster may get out of resources.
I can share with you that this idea was proposed multiple times internally in IBM (much before GIE) but the answer was always the same - the cluster can be scaled with more resources and the complexity of spreading across multiple clusters doesn't worth it when looking at the tradeoff.
I would try to focus on this point - I think you need to find at least one use case or problem that cannot be solved by scaling the single cluster with more resources.
I think there is no doubt that this proposal adds complexity to GIE and there should be a real requirement or real use case for us to do that.
Initial design doc: https://docs.google.com/document/d/1QGvG9ToaJ72vlCBdJe--hmrmLtgOV_ptJi9D58QMD2w/edit?usp=sharing