Skip to content

Proposal for Multi-Cluster Inference Gateways #1374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

bexxmodd
Copy link
Contributor

@bexxmodd bexxmodd commented Aug 14, 2025

Copy link

netlify bot commented Aug 14, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit d2e274f
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/689e23a345f77200080fb0ff
😎 Deploy Preview https://deploy-preview-1374--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bexxmodd
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 14, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @bexxmodd!

It looks like this is your first PR to kubernetes-sigs/gateway-api-inference-extension 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api-inference-extension has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @bexxmodd. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 14, 2025
@bexxmodd
Copy link
Contributor Author

/cc @robscott

@k8s-ci-robot k8s-ci-robot requested a review from robscott August 14, 2025 00:50
@robscott
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 14, 2025
@nirrozenbaum
Copy link
Contributor

@bexxmodd can you please remove the .DS_Store files?

@bexxmodd
Copy link
Contributor Author

bexxmodd commented Aug 14, 2025

@bexxmodd can you please remove the .DS_Store files?

Removed.

Also, created PR to gitignore macOS generated files #1378


## Summary

Inference Gateways aim to provide efficient routing to LLM workloads running in Kubernetes. In practice, an Inference Gateway is a Gateway that conforms to the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/). This Gateway supports a new type of backend - InferencePool. When routing to an [InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/), the Gateway calls out to an “Endpoint Picker” referenced by the InferencePool to get instructions on which specific endpoint within the pool it should reference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref to the conformance section instead of the overall docs site: https://gateway-api-inference-extension.sigs.k8s.io/concepts/conformance/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ it should reference./ it should route the request to.

### Goals

Enable Inference Gateways to route to backends in multiple clusters
Follow a pattern that is familiar to users of Multi-Cluster Services (MCS) and/or Gateways
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to MCS


### Non-Goals

Be overly prescriptive about implementation details - this should focus on the resulting UX and leave significant flexibility in how it is achieved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each should be a bullet and end with a period.


## Proposal

The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will add support for having separate Endpoint Pickers in each cluster.

Can you provide additional context here. Separate as in separate from the EPP ref'd by an InferencePool with an export annotation?

#### InferencePool

A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `Local` and `ClusterSet`. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only start with ClusterSet until a use case for Local or any other supported value is accepted.


1. Endpoint Pickers
1. Model Server Endpoints

Copy link
Contributor

@danehans danehans Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should also consider routing to a remote EPP through a remote cluster Gateway (see the original design doc appendix).

  • Why does InferencePoolImport need to know about the model server endpoints in a remote cluster? I would expect the local Gateway to route to remote EPPs or through a remote Gateway based on one of the following conditions:

    • A local InferencePool exists and no local GPUs are available, e.g. EPP returns a 503/429.
    • A local InferencePool exists and the local Gateway decides the request is better served by an InferencePoolImport.
    • No local InferencePool exists but an InferencePoolImport exists with a status that indicates available GPU resources.

Note: The EPP protocol spec should be updated when this design is finalized (please create a tracker issue).


#### Consider FailOpen “Extended” support for multi-cluster

Given the potential complexity of supporting a FailOpen mode for multi-cluster, we could consider this “Extended” or optional support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on MC failover being "Extended" support.


#### Metrics from model server endpoints

In the case where a Gateway is aware of all model server endpoints, it could theoretically also track metrics for each of these endpoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the work of the EPP.


#### Metrics from Endpoint Picker

Since Gateways are ultimately deciding which Endpoint Picker to send traffic to, it could make sense for Endpoint Pickers to report back load/utilization data to the Gateway to help inform that decision. (This would reflect the utilization of model server Pods within the local InferencePool managed by each EPP).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPP already exposes InferencePool metrics. It will be up to the implementation on how to use these metrics to make a routing decision.


## Alternative 1: MCS API for EPP

If we lean into the idea that the only thing a Gateway needs to know is the Endpoint Picker endpoints and what cluster(s) they're associated with, we could build this on top of the MCS API. With this approach, the Endpoint Picker is exposed with a Multi-Cluster Service:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InferencePool is meant to be a replacement for a Service so it may seem counterintuitive for a user to create a Service to achieve multi-cluster inference.


***Draft***

## Summary
Copy link
Contributor

@nirrozenbaum nirrozenbaum Aug 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the proposal. overall it looks very nice and at the very high level it could work (not getting into the details).
I think the main part that is still missing here is the motivation.
the only motivation that was mentioned in this doc is that the cluster may get out of resources.

I can share with you that this idea was proposed multiple times internally in IBM (much before GIE) but the answer was always the same - the cluster can be scaled with more resources and the complexity of spreading across multiple clusters doesn't worth it when looking at the tradeoff.

I would try to focus on this point - I think you need to find at least one use case or problem that cannot be solved by scaling the single cluster with more resources.

I think there is no doubt that this proposal adds complexity to GIE and there should be a real requirement or real use case for us to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants