-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: initial implementation of Inference Extension #493
Conversation
Signed-off-by: Takeshi Yoneda <[email protected]>
cc @yuzisun |
Signed-off-by: Takeshi Yoneda <[email protected]>
obviously not working but i think this shows how this can be implemented here with minimal changes |
one big missing piece is to generate a special Envoy cluster that has a new LB policy attached; that can be a global EnvoyPatchPolicy resource used at a global level. |
ok get back to this ...... |
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Instead of allowing AIServiceBackend.BackendRef to reference InferencePool, it seem semantically correct to allow |
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
ok the control plane is almost done |
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
monday i will continue backfilling the unit tests for control plane and then extproc part |
almos there |
sorry that the PR gets larger than I thought, so please refer to the PR description for the overview before diving into it |
Signed-off-by: Takeshi Yoneda <[email protected]>
One thing to consider; i think it also makes sense to do the dns resolution during the reconciliation and return |
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work @mathetake 🙌
Thanks @mathetake ! I am going to take a look tonight. |
No rush ! |
I also should be able to do a review in the next 24h. Thanks for this! |
AIGatewayRouteRuleBackendRefAIServiceBackend AIGatewayRouteRuleBackendRefKind = "AIServiceBackend" | ||
// AIGatewayRouteRuleBackendRefInferencePool is the kind of the InferencePool in the Gateway API Inference Extension. | ||
// https://github.com/kubernetes-sigs/gateway-api-inference-extension | ||
AIGatewayRouteRuleBackendRefInferencePool AIGatewayRouteRuleBackendRefKind = "InferencePool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently InferencePool is a list of pods, If I want to create a pool of AIServiceBackends, would that be another CR for AIServiceBackendPool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently InferencePool is a list of pods,
As i commented in the description as well as in the API's comment, InferencePool.Spec.Selector will be used to select AIServiceBackends, not pods. That is allowed in the API spec of InfExt and i think that works as you intended in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah saw that in your examples now, that’s pretty neat way to unify the ingress and egress use cases!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the dynamic load balancing becomes a core feature of the envoy ai gateway, then this InferencePool API will be part of the core AI Gateway API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's a good point and my concern as well; I think the point would be like where we enforce the cluster level load balancing (!= endpoint level one as it cannot do the transformation,auth etc) functionality. If it won't be a part of InfExt API's scope, then we are good to go like we provide cluster level dynamic load balancing at our API and InfPool can only do the endpoint level stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a concern with relying on the InfPool API? Our intent is to keep it simple and flexible. (as it was used here, we use it for pods in the reference implementation, but as you have done here, any inference endpoint can work so long as its being selected on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no concern at the moment but we will see
Signed-off-by: Takeshi Yoneda <[email protected]>
) { | ||
m, ok := dlb.models[model] | ||
if !ok { | ||
err = fmt.Errorf("model %s is not found in the dynamic load balancer", model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be necessarily an error as models
is optional in dlb, it could a list of backend hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this line corresponds to the reference implementation of this line:
so this !ok == true
means that we couldn't find the model name in the list of InferenceModel(s) that belongs to the InferencePool. I think i should make the "models" not omitempty and skip the construction of dlb where there's no models belonging to the inferencepool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, I left a few comments. But this is awesome!
I'll run through the guide to set this up myself to try it out
} | ||
|
||
// DynamicLoadBalancing corresponds to InferencePool and InferenceModels belonging to the same pool. | ||
type DynamicLoadBalancing struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC, why the indirection of InferenceModel?
Was Backend
preexisting and this just the faster method of adoption?
I think it's awesome, I'm actually really excited to see that you all implemented your own extension, just hoping to learn where our API may have some rough edges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this filterapi
is intentionally trying to basically be decouple the extproc component away from k8s/gateway concepts so that's why you see the "indirection" here. The rationale is the ease of testing (ie running the e2e with the extproc) as well as to drive the adoption/use of the core AI feature outside EG (in fact we are using it in our internal Istio environment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear in other words, this is not the configuration/control plane API and more of an implementation detail for almost all the people except for the advanced users using only the core extproc without using the control plane here.
good question anyway!
AIGatewayRouteRuleBackendRefAIServiceBackend AIGatewayRouteRuleBackendRefKind = "AIServiceBackend" | ||
// AIGatewayRouteRuleBackendRefInferencePool is the kind of the InferencePool in the Gateway API Inference Extension. | ||
// https://github.com/kubernetes-sigs/gateway-api-inference-extension | ||
AIGatewayRouteRuleBackendRefInferencePool AIGatewayRouteRuleBackendRefKind = "InferencePool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a concern with relying on the InfPool API? Our intent is to keep it simple and flexible. (as it was used here, we use it for pods in the reference implementation, but as you have done here, any inference endpoint can work so long as its being selected on.
} | ||
return err | ||
} | ||
if err := c.syncInferencePoolFn(ctx, &inferencePool); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these syncs cascade up to the syncAIGatewayRoute
and eventually call reconcileExtProcConfigMap
to reconcile on the inferenceModels.
We expected InferenceModels to potentially have high churn, if a user has a high number of LoRA adapters, and rolls out new versions regularly. I'm not sure how expensive the complete reconciliation func is, but just calling that out as a potential issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i am aware of that potential overload to syncAIGatewayRoute function - it should be possible to refactor later (i am not an expert on k8s controller impl to be clear...)
type DynamicLoadBalancing struct { | ||
// Models that can be served by this backend. If not matched, the 404 is returned to the client. | ||
// | ||
// If multiple models are provided, the request is routed to the backend based on the weights, criticality, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want fallback between multiple models, for example if provisioned throughout
model does not have extra capacity we fallback to on-demand
model. In this case routing is based on model name. A concrete example is to route to bedrock-runtime.us-east-1.aws.com
and fallback using two different model names model1-pt
and model1-od
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to extend InferenceModel API to add a field fallbackModels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Models and Backends 1-1 maping, like
model[i] -> backend[i] ?
or
[mode0, model1, ... model_N-1] -> backend [0]
...
[mode0, model1, ... model_N-1] -> backend [N-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case N to M is the answer. InferencePool can reference multiple AIServiceBackends via Selector and one pool can have multiple models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the model name fallback is interesting ... I think it's not within the realm of InferencePool at the moment vs currently it only has weight based random picking (optional) cc @kfswain https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/a13a1239330ffaaafa4d0f948c00cafd106086aa/pkg/epp/handlers/request.go#L71-L72
Signed-off-by: Takeshi Yoneda <[email protected]>
cool, i think this should be almost ready! |
thank you folks!!!!!!1 |
Commit Message
This commit scaffolds the foundation for the Inference Extension API [1]. The design documentation was merged in #492. The controller needs to be started with
--enableInferenceExtension=true
to not break the existing controller deployment where the Inference Extension CRDs are not installed.This commit doesn't implement the actual "metrics-aware" load balancing and instead it just does the random routing out of given (resolved) endpoints. The follow up implementations will add more advanced algorithm while expanding the metrics interface that currently only provides the setter APIs.
The summary of the implementation is:
kind
field to AIGatewayRouteRuleBackendRef so that it can reference InferencePool.1: https://github.com/kubernetes-sigs/gateway-api-inference-extension
Related Issues/PRs (if applicable)
Built on #492
Contributes to #423