Skip to content

Inference Extension: Support model traffic splitting #3840

@sjberman

Description

@sjberman

When a client sends a request to an AI workload, the desired model name (e.g. gpt-4o, llama, etc.) is included in the request body.

By default, the EPP gets the model name from the request body, and then picks the proper endpoint for that model name. However, the model name could also be provided via header (X-Gateway-Model-Name). For example, a user could specify a desire for a traffic split, and therefore NGINX would need to change the model name depending the weighted traffic split decision by setting the header.

Using the NJS module that extracts the model name from the request body, we should set the X-Gateway-Model-Name header appropriately when querying the EPP, so that it returns the proper endpoint based on the model that the user desired for their request.

Acceptance Criteria:

  • If the HTTPRoute specifies an Exact match condition on the X-Gateway-Model-Name header AND specifies a RequestHeaderModifier for this header, NGINX extracts the model name from the client request body.
  • If the user is performing a traffic split (see YAML example 2 in the design doc), and if the model name extracted in the request MATCHES the condition specified in the HTTPRoute, then NGINX should set the new header value based on the RequestHeaderModifier filter and the proper weighted decision (using split_clients) in the request to the EPP.

Design doc: https://github.com/nginx/nginx-gateway-fabric/blob/main/docs/proposals/gateway-inference-extension.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions