Inference Extension: Support model traffic splitting

When a client sends a request to an AI workload, the desired model name (e.g. gpt-4o, llama, etc.) is included in the request body.

By default, the EPP gets the model name from the request body, and then picks the proper endpoint for that model name. However, the model name could also be provided via header (`X-Gateway-Model-Name`). For example, a user could specify a desire for a traffic split, and therefore NGINX would need to change the model name depending the weighted traffic split decision by setting the header.

Using the NJS module that extracts the model name from the request body, we should set the `X-Gateway-Model-Name` header appropriately when querying the EPP, so that it returns the proper endpoint based on the model that the user desired for their request.

Acceptance Criteria:

- If the HTTPRoute specifies an Exact match condition on the `X-Gateway-Model-Name` header AND specifies a `RequestHeaderModifier` for this header, NGINX extracts the model name from the client request body.
- If the user is performing a traffic split (see YAML example 2 in the design doc), and if the model name extracted in the request MATCHES the condition specified in the HTTPRoute, then NGINX should set the new header value based on the `RequestHeaderModifier` filter and the proper weighted decision (using `split_clients`) in the request to the EPP.

Design doc: https://github.com/nginx/nginx-gateway-fabric/blob/main/docs/proposals/gateway-inference-extension.md


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference Extension: Support model traffic splitting #3840

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference Extension: Support model traffic splitting #3840

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions