Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial implementation of Inference Extension #493

Merged
merged 67 commits into from
Apr 2, 2025
Merged

feat: initial implementation of Inference Extension #493

merged 67 commits into from
Apr 2, 2025

Conversation

mathetake
Copy link
Member

@mathetake mathetake commented Mar 14, 2025

Commit Message

This commit scaffolds the foundation for the Inference Extension API [1]. The design documentation was merged in #492. The controller needs to be started with --enableInferenceExtension=true to not break the existing controller deployment where the Inference Extension CRDs are not installed.

This commit doesn't implement the actual "metrics-aware" load balancing and instead it just does the random routing out of given (resolved) endpoints. The follow up implementations will add more advanced algorithm while expanding the metrics interface that currently only provides the setter APIs.

The summary of the implementation is:

  • Added kind field to AIGatewayRouteRuleBackendRef so that it can reference InferencePool.
  • InferencePool.Spec.Selector is allowed to specify multiple AIServiceBackend.
  • When building up all the extproc config via filterapi.Config, the controller reads the referenced InferencePool and its binding InferenceModels, and group them together into a single filterapi.DynamicLoadBalancing configuration.
  • When the extproc loads the configuration containing DynamicLoadBalancing, it will resolve all the IP addresses for hostnames belonging to the DyanmicLoadbalancing. The presence of DynamicLoadBalancing in th config forces the config watcher reload and refresh the config regardless of the updates. That way, the list of ip addresses will always be updated (eventual consistency anyways) in a non-hot path.
  • On the request path, the ChatCompletionProcessor will check the existence of the DynamicLoadBalancing config for the backend selected by the router. If so, it further tries to resolve the ip:port level endpoint selection.
  • The selected ip:port will be set to the special header that will be routed to ORIGINAL_DST.
  • ORIGINAL_DST cluster will be added by the EG extension sever implementation. Also, the extension server modifies some routes to properly route to that cluster.

1: https://github.com/kubernetes-sigs/gateway-api-inference-extension

Related Issues/PRs (if applicable)

Built on #492
Contributes to #423

Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member Author

cc @yuzisun

Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member Author

obviously not working but i think this shows how this can be implemented here with minimal changes

@mathetake
Copy link
Member Author

mathetake commented Mar 14, 2025

one big missing piece is to generate a special Envoy cluster that has a new LB policy attached; that can be a global EnvoyPatchPolicy resource used at a global level.

@mathetake
Copy link
Member Author

ok get back to this ......

Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake mathetake changed the title do not merge very minimum poc feat: initial implementation of Inference Extension Mar 21, 2025
Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member Author

Instead of allowing AIServiceBackend.BackendRef to reference InferencePool, it seem semantically correct to allow AIGatewayRouteRuleBackendRef to reference it, so i will tweak the API

Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member Author

ok the control plane is almost done

Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member Author

monday i will continue backfilling the unit tests for control plane and then extproc part

@mathetake
Copy link
Member Author

almos there

@mathetake
Copy link
Member Author

sorry that the PR gets larger than I thought, so please refer to the PR description for the overview before diving into it

Signed-off-by: Takeshi Yoneda <[email protected]>
@mathetake
Copy link
Member Author

mathetake commented Mar 26, 2025

One thing to consider; i think it also makes sense to do the dns resolution during the reconciliation and return Requeue: true (like OIDC refresh) when the dns refresh is necessary rather than letting extproc do that. We can do the refactoring in that way later as in anyways that's an implementation detail

Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Copy link
Contributor

@missBerg missBerg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work @mathetake 🙌

@yuzisun
Copy link
Contributor

yuzisun commented Mar 26, 2025

Thanks @mathetake ! I am going to take a look tonight.

@mathetake
Copy link
Member Author

No rush !

@kfswain
Copy link

kfswain commented Mar 26, 2025

I also should be able to do a review in the next 24h. Thanks for this!

AIGatewayRouteRuleBackendRefAIServiceBackend AIGatewayRouteRuleBackendRefKind = "AIServiceBackend"
// AIGatewayRouteRuleBackendRefInferencePool is the kind of the InferencePool in the Gateway API Inference Extension.
// https://github.com/kubernetes-sigs/gateway-api-inference-extension
AIGatewayRouteRuleBackendRefInferencePool AIGatewayRouteRuleBackendRefKind = "InferencePool"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently InferencePool is a list of pods, If I want to create a pool of AIServiceBackends, would that be another CR for AIServiceBackendPool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently InferencePool is a list of pods,

As i commented in the description as well as in the API's comment, InferencePool.Spec.Selector will be used to select AIServiceBackends, not pods. That is allowed in the API spec of InfExt and i think that works as you intended in the comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah saw that in your examples now, that’s pretty neat way to unify the ingress and egress use cases!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the dynamic load balancing becomes a core feature of the envoy ai gateway, then this InferencePool API will be part of the core AI Gateway API.

Copy link
Member Author

@mathetake mathetake Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a good point and my concern as well; I think the point would be like where we enforce the cluster level load balancing (!= endpoint level one as it cannot do the transformation,auth etc) functionality. If it won't be a part of InfExt API's scope, then we are good to go like we provide cluster level dynamic load balancing at our API and InfPool can only do the endpoint level stuff.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concern with relying on the InfPool API? Our intent is to keep it simple and flexible. (as it was used here, we use it for pods in the reference implementation, but as you have done here, any inference endpoint can work so long as its being selected on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no concern at the moment but we will see

Signed-off-by: Takeshi Yoneda <[email protected]>
) {
m, ok := dlb.models[model]
if !ok {
err = fmt.Errorf("model %s is not found in the dynamic load balancer", model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be necessarily an error as models is optional in dlb, it could a list of backend hosts.

Copy link
Member Author

@mathetake mathetake Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this line corresponds to the reference implementation of this line:

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/2fed6caf50229ad0ee9ded24ae1c7e8fdb622bcb/pkg/epp/handlers/request.go#L67-L70

so this !ok == true means that we couldn't find the model name in the list of InferenceModel(s) that belongs to the InferencePool. I think i should make the "models" not omitempty and skip the construction of dlb where there's no models belonging to the inferencepool

Copy link

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, I left a few comments. But this is awesome!

I'll run through the guide to set this up myself to try it out

}

// DynamicLoadBalancing corresponds to InferencePool and InferenceModels belonging to the same pool.
type DynamicLoadBalancing struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, why the indirection of InferenceModel?

Was Backend preexisting and this just the faster method of adoption?

I think it's awesome, I'm actually really excited to see that you all implemented your own extension, just hoping to learn where our API may have some rough edges

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this filterapi is intentionally trying to basically be decouple the extproc component away from k8s/gateway concepts so that's why you see the "indirection" here. The rationale is the ease of testing (ie running the e2e with the extproc) as well as to drive the adoption/use of the core AI feature outside EG (in fact we are using it in our internal Istio environment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear in other words, this is not the configuration/control plane API and more of an implementation detail for almost all the people except for the advanced users using only the core extproc without using the control plane here.

good question anyway!

AIGatewayRouteRuleBackendRefAIServiceBackend AIGatewayRouteRuleBackendRefKind = "AIServiceBackend"
// AIGatewayRouteRuleBackendRefInferencePool is the kind of the InferencePool in the Gateway API Inference Extension.
// https://github.com/kubernetes-sigs/gateway-api-inference-extension
AIGatewayRouteRuleBackendRefInferencePool AIGatewayRouteRuleBackendRefKind = "InferencePool"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concern with relying on the InfPool API? Our intent is to keep it simple and flexible. (as it was used here, we use it for pods in the reference implementation, but as you have done here, any inference endpoint can work so long as its being selected on.

}
return err
}
if err := c.syncInferencePoolFn(ctx, &inferencePool); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these syncs cascade up to the syncAIGatewayRoute and eventually call reconcileExtProcConfigMap to reconcile on the inferenceModels.

We expected InferenceModels to potentially have high churn, if a user has a high number of LoRA adapters, and rolls out new versions regularly. I'm not sure how expensive the complete reconciliation func is, but just calling that out as a potential issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i am aware of that potential overload to syncAIGatewayRoute function - it should be possible to refactor later (i am not an expert on k8s controller impl to be clear...)

type DynamicLoadBalancing struct {
// Models that can be served by this backend. If not matched, the 404 is returned to the client.
//
// If multiple models are provided, the request is routed to the backend based on the weights, criticality, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want fallback between multiple models, for example if provisioned throughout model does not have extra capacity we fallback to on-demand model. In this case routing is based on model name. A concrete example is to route to bedrock-runtime.us-east-1.aws.com and fallback using two different model names model1-pt and model1-od.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to extend InferenceModel API to add a field fallbackModels

Copy link
Contributor

@wengyao04 wengyao04 Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Models and Backends 1-1 maping, like
model[i] -> backend[i] ?

or
[mode0, model1, ... model_N-1] -> backend [0]
...
[mode0, model1, ... model_N-1] -> backend [N-1]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case N to M is the answer. InferencePool can reference multiple AIServiceBackends via Selector and one pool can have multiple models

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the model name fallback is interesting ... I think it's not within the realm of InferencePool at the moment vs currently it only has weight based random picking (optional) cc @kfswain https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/a13a1239330ffaaafa4d0f948c00cafd106086aa/pkg/epp/handlers/request.go#L71-L72

@mathetake mathetake requested review from yuzisun and wengyao04 April 2, 2025 13:42
@mathetake
Copy link
Member Author

cool, i think this should be almost ready!

@mathetake mathetake merged commit ebdf974 into main Apr 2, 2025
17 checks passed
@mathetake mathetake deleted the poc branch April 2, 2025 16:32
@mathetake
Copy link
Member Author

thank you folks!!!!!!1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants