feat: initial implementation of Inference Extension #493

mathetake · 2025-03-14T00:14:00Z

Commit Message

This commit scaffolds the foundation for the Inference Extension API [1]. The design documentation was merged in #492. The controller needs to be started with --enableInferenceExtension=true to not break the existing controller deployment where the Inference Extension CRDs are not installed.

This commit doesn't implement the actual "metrics-aware" load balancing and instead it just does the random routing out of given (resolved) endpoints. The follow up implementations will add more advanced algorithm while expanding the metrics interface that currently only provides the setter APIs.

The summary of the implementation is:

Added kind field to AIGatewayRouteRuleBackendRef so that it can reference InferencePool.
InferencePool.Spec.Selector is allowed to specify multiple AIServiceBackend.
When building up all the extproc config via filterapi.Config, the controller reads the referenced InferencePool and its binding InferenceModels, and group them together into a single filterapi.DynamicLoadBalancing configuration.
When the extproc loads the configuration containing DynamicLoadBalancing, it will resolve all the IP addresses for hostnames belonging to the DyanmicLoadbalancing. The presence of DynamicLoadBalancing in th config forces the config watcher reload and refresh the config regardless of the updates. That way, the list of ip addresses will always be updated (eventual consistency anyways) in a non-hot path.
On the request path, the ChatCompletionProcessor will check the existence of the DynamicLoadBalancing config for the backend selected by the router. If so, it further tries to resolve the ip:port level endpoint selection.
The selected ip:port will be set to the special header that will be routed to ORIGINAL_DST.
ORIGINAL_DST cluster will be added by the EG extension sever implementation. Also, the extension server modifies some routes to properly route to that cluster.

1: https://github.com/kubernetes-sigs/gateway-api-inference-extension

Related Issues/PRs (if applicable)

Built on #492
Contributes to #423

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-03-14T00:14:08Z

cc @yuzisun

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-03-14T00:16:40Z

obviously not working but i think this shows how this can be implemented here with minimal changes

mathetake · 2025-03-14T00:17:28Z

one big missing piece is to generate a special Envoy cluster that has a new LB policy attached; that can be a global EnvoyPatchPolicy resource used at a global level.

filterapi/filterconfig.go

mathetake · 2025-03-21T15:38:48Z

ok get back to this ......

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-03-21T21:48:16Z

Instead of allowing AIServiceBackend.BackendRef to reference InferencePool, it seem semantically correct to allow AIGatewayRouteRuleBackendRef to reference it, so i will tweak the API

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-03-21T23:46:23Z

ok the control plane is almost done

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-03-22T00:47:39Z

monday i will continue backfilling the unit tests for control plane and then extproc part

mathetake · 2025-03-22T00:47:44Z

almos there

mathetake · 2025-03-26T00:41:29Z

sorry that the PR gets larger than I thought, so please refer to the PR description for the overview before diving into it

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-03-26T02:06:20Z

One thing to consider; i think it also makes sense to do the dns resolution during the reconciliation and return Requeue: true (like OIDC refresh) when the dns refresh is necessary rather than letting extproc do that. We can do the refactoring in that way later as in anyways that's an implementation detail

Signed-off-by: Takeshi Yoneda <[email protected]>

missBerg

Thanks for all the hard work @mathetake 🙌

yuzisun · 2025-03-26T23:28:02Z

Thanks @mathetake ! I am going to take a look tonight.

mathetake · 2025-03-26T23:30:33Z

No rush !

kfswain · 2025-03-26T23:31:05Z

I also should be able to do a review in the next 24h. Thanks for this!

yuzisun · 2025-03-26T23:43:58Z

api/v1alpha1/api.go

+	AIGatewayRouteRuleBackendRefAIServiceBackend AIGatewayRouteRuleBackendRefKind = "AIServiceBackend"
+	// AIGatewayRouteRuleBackendRefInferencePool is the kind of the InferencePool in the Gateway API Inference Extension.
+	// https://github.com/kubernetes-sigs/gateway-api-inference-extension
+	AIGatewayRouteRuleBackendRefInferencePool AIGatewayRouteRuleBackendRefKind = "InferencePool"


Currently InferencePool is a list of pods, If I want to create a pool of AIServiceBackends, would that be another CR for AIServiceBackendPool?

Currently InferencePool is a list of pods,

As i commented in the description as well as in the API's comment, InferencePool.Spec.Selector will be used to select AIServiceBackends, not pods. That is allowed in the API spec of InfExt and i think that works as you intended in the comment?

Ah saw that in your examples now, that’s pretty neat way to unify the ingress and egress use cases!

if the dynamic load balancing becomes a core feature of the envoy ai gateway, then this InferencePool API will be part of the core AI Gateway API.

yeah that's a good point and my concern as well; I think the point would be like where we enforce the cluster level load balancing (!= endpoint level one as it cannot do the transformation,auth etc) functionality. If it won't be a part of InfExt API's scope, then we are good to go like we provide cluster level dynamic load balancing at our API and InfPool can only do the endpoint level stuff.

Is there a concern with relying on the InfPool API? Our intent is to keep it simple and flexible. (as it was used here, we use it for pods in the reference implementation, but as you have done here, any inference endpoint can work so long as its being selected on.

no concern at the moment but we will see

Signed-off-by: Takeshi Yoneda <[email protected]>

yuzisun · 2025-03-27T01:39:31Z

internal/extproc/dynlb/dynlb.go

+) {
+	m, ok := dlb.models[model]
+	if !ok {
+		err = fmt.Errorf("model %s is not found in the dynamic load balancer", model)


This may not be necessarily an error as models is optional in dlb, it could a list of backend hosts.

yeah this line corresponds to the reference implementation of this line:

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/2fed6caf50229ad0ee9ded24ae1c7e8fdb622bcb/pkg/epp/handlers/request.go#L67-L70

so this !ok == true means that we couldn't find the model name in the list of InferenceModel(s) that belongs to the InferencePool. I think i should make the "models" not omitempty and skip the construction of dlb where there's no models belonging to the inferencepool

internal/controller/ai_gateway_route.go

kfswain

Overall LGTM, I left a few comments. But this is awesome!

I'll run through the guide to set this up myself to try it out

kfswain · 2025-03-31T20:18:52Z

filterapi/filterconfig.go

+}
+
+// DynamicLoadBalancing corresponds to InferencePool and InferenceModels belonging to the same pool.
+type DynamicLoadBalancing struct {


OOC, why the indirection of InferenceModel?

Was Backend preexisting and this just the faster method of adoption?

I think it's awesome, I'm actually really excited to see that you all implemented your own extension, just hoping to learn where our API may have some rough edges

so this filterapi is intentionally trying to basically be decouple the extproc component away from k8s/gateway concepts so that's why you see the "indirection" here. The rationale is the ease of testing (ie running the e2e with the extproc) as well as to drive the adoption/use of the core AI feature outside EG (in fact we are using it in our internal Istio environment)

to be clear in other words, this is not the configuration/control plane API and more of an implementation detail for almost all the people except for the advanced users using only the core extproc without using the control plane here.

good question anyway!

kfswain · 2025-03-31T20:32:01Z

api/v1alpha1/api.go

+	AIGatewayRouteRuleBackendRefAIServiceBackend AIGatewayRouteRuleBackendRefKind = "AIServiceBackend"
+	// AIGatewayRouteRuleBackendRefInferencePool is the kind of the InferencePool in the Gateway API Inference Extension.
+	// https://github.com/kubernetes-sigs/gateway-api-inference-extension
+	AIGatewayRouteRuleBackendRefInferencePool AIGatewayRouteRuleBackendRefKind = "InferencePool"


Is there a concern with relying on the InfPool API? Our intent is to keep it simple and flexible. (as it was used here, we use it for pods in the reference implementation, but as you have done here, any inference endpoint can work so long as its being selected on.

kfswain · 2025-03-31T20:44:20Z

internal/controller/infext_model.go

+		}
+		return err
+	}
+	if err := c.syncInferencePoolFn(ctx, &inferencePool); err != nil {


So these syncs cascade up to the syncAIGatewayRoute and eventually call reconcileExtProcConfigMap to reconcile on the inferenceModels.

We expected InferenceModels to potentially have high churn, if a user has a high number of LoRA adapters, and rolls out new versions regularly. I'm not sure how expensive the complete reconciliation func is, but just calling that out as a potential issue.

yeah i am aware of that potential overload to syncAIGatewayRoute function - it should be possible to refactor later (i am not an expert on k8s controller impl to be clear...)

yuzisun · 2025-04-01T19:26:22Z

filterapi/filterconfig.go

+type DynamicLoadBalancing struct {
+	// Models that can be served by this backend. If not matched, the 404 is returned to the client.
+	//
+	// If multiple models are provided, the request is routed to the backend based on the weights, criticality, etc.


We also want fallback between multiple models, for example if provisioned throughout model does not have extra capacity we fallback to on-demand model. In this case routing is based on model name. A concrete example is to route to bedrock-runtime.us-east-1.aws.com and fallback using two different model names model1-pt and model1-od.

We may need to extend InferenceModel API to add a field fallbackModels

Is Models and Backends 1-1 maping, like
model[i] -> backend[i] ?

or
[mode0, model1, ... model_N-1] -> backend [0]
...
[mode0, model1, ... model_N-1] -> backend [N-1]

this case N to M is the answer. InferencePool can reference multiple AIServiceBackends via Selector and one pool can have multiple models

so the model name fallback is interesting ... I think it's not within the realm of InferencePool at the moment vs currently it only has weight based random picking (optional) cc @kfswain https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/a13a1239330ffaaafa4d0f948c00cafd106086aa/pkg/epp/handlers/request.go#L71-L72

internal/controller/controller.go

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake · 2025-04-02T13:45:51Z

cool, i think this should be almost ready!

mathetake · 2025-04-02T16:32:49Z

thank you folks!!!!!!1

very minimum poc

243a4eb

Signed-off-by: Takeshi Yoneda <[email protected]>

not working

1c31f27

Signed-off-by: Takeshi Yoneda <[email protected]>

yuzisun reviewed Mar 14, 2025

View reviewed changes

filterapi/filterconfig.go Outdated Show resolved Hide resolved

mathetake commented Mar 18, 2025

View reviewed changes

filterapi/filterconfig.go Outdated Show resolved Hide resolved

mathetake mentioned this pull request Mar 20, 2025

Support k8s gateway API inference extensions #423

Open

3 tasks

mathetake added 6 commits March 21, 2025 08:41

merge

68369b0

Signed-off-by: Takeshi Yoneda <[email protected]>

more sketch

70eaafc

Signed-off-by: Takeshi Yoneda <[email protected]>

more sketch

6ddeea4

Signed-off-by: Takeshi Yoneda <[email protected]>

more sketch

ffe6e46

Signed-off-by: Takeshi Yoneda <[email protected]>

more sketch

eec4fd3

Signed-off-by: Takeshi Yoneda <[email protected]>

more

b1ab6af

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake changed the title ~~do not merge very minimum poc~~ feat: initial implementation of Inference Extension Mar 21, 2025

more comments

595b392

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake added 2 commits March 21, 2025 16:14

more comments

dec1816

Signed-off-by: Takeshi Yoneda <[email protected]>

more

48aac23

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake added 6 commits March 21, 2025 16:54

more

5f6fe99

Signed-off-by: Takeshi Yoneda <[email protected]>

unnecessary changes

c5a0307

Signed-off-by: Takeshi Yoneda <[email protected]>

unnecessary changes

c016860

Signed-off-by: Takeshi Yoneda <[email protected]>

tidy

4d1d4ec

Signed-off-by: Takeshi Yoneda <[email protected]>

tidy

55498d5

Signed-off-by: Takeshi Yoneda <[email protected]>

more unit tests

3bf039d

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake mentioned this pull request Mar 24, 2025

Envoy AI Gateway CLI like aigw in Envoy Gateway #412

Closed

more

82df969

Signed-off-by: Takeshi Yoneda <[email protected]>

mathetake requested a review from kfswain March 26, 2025 01:44

mathetake assigned yuzisun Mar 26, 2025

mathetake added 2 commits March 26, 2025 09:06

add more unit test

431c442

Signed-off-by: Takeshi Yoneda <[email protected]>

merge main

f4a8b5e

Signed-off-by: Takeshi Yoneda <[email protected]>

missBerg approved these changes Mar 26, 2025

View reviewed changes

mathetake assigned kfswain Mar 26, 2025

yuzisun reviewed Mar 26, 2025

View reviewed changes

more

fe106d5

Signed-off-by: Takeshi Yoneda <[email protected]>

yuzisun reviewed Mar 27, 2025

View reviewed changes

MaYuan-02 reviewed Mar 27, 2025

View reviewed changes

internal/controller/ai_gateway_route.go Show resolved Hide resolved

kfswain reviewed Mar 31, 2025

View reviewed changes

arkodg mentioned this pull request Mar 31, 2025

Initial Gateway API Inference Extension Blog Post kubernetes/website#49898

Open

yuzisun reviewed Apr 1, 2025

View reviewed changes

wengyao04 reviewed Apr 1, 2025

View reviewed changes

internal/controller/controller.go Show resolved Hide resolved

mathetake added 2 commits April 2, 2025 14:38

review: fix comment on index

1f96bab

Signed-off-by: Takeshi Yoneda <[email protected]>

Merge remote-tracking branch 'origin/main' into poc

5b857fd

mathetake requested review from yuzisun and wengyao04 April 2, 2025 13:42

wengyao04 approved these changes Apr 2, 2025

View reviewed changes

mathetake merged commit ebdf974 into main Apr 2, 2025
17 checks passed

mathetake deleted the poc branch April 2, 2025 16:32

Xunzhuo mentioned this pull request Apr 5, 2025

Add Envoy AI Gateway Guides kubernetes-sigs/gateway-api-inference-extension#651

Open

feat: initial implementation of Inference Extension #493

feat: initial implementation of Inference Extension #493

Conversation

mathetake commented Mar 14, 2025 • edited Loading

mathetake commented Mar 14, 2025

mathetake commented Mar 14, 2025

mathetake commented Mar 14, 2025 • edited Loading

mathetake commented Mar 21, 2025

mathetake commented Mar 21, 2025

mathetake commented Mar 21, 2025

mathetake commented Mar 22, 2025

mathetake commented Mar 22, 2025

mathetake commented Mar 26, 2025

mathetake commented Mar 26, 2025 • edited Loading

missBerg left a comment

Choose a reason for hiding this comment

yuzisun commented Mar 26, 2025

mathetake commented Mar 26, 2025

kfswain commented Mar 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathetake Mar 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathetake Mar 27, 2025 • edited Loading

Choose a reason for hiding this comment

kfswain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wengyao04 Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathetake commented Apr 2, 2025

mathetake commented Apr 2, 2025

mathetake commented Mar 14, 2025 •

edited

Loading

mathetake commented Mar 14, 2025 •

edited

Loading

mathetake commented Mar 26, 2025 •

edited

Loading

mathetake Mar 27, 2025 •

edited

Loading

mathetake Mar 27, 2025 •

edited

Loading

wengyao04 Apr 1, 2025 •

edited

Loading