Why does my tirton service response time keep increasing at high QPM?

<img width="524" height="260" alt="Image" src="https://github.com/user-attachments/assets/9ef3de38-0681-4f57-839b-7d690b4d7552" />

The vertical axis is ms, QPM=2000. Model work on the service is a xlm-roberta-large.onnx with dynamic batch_size and squence_length.

Here is the config.pbtxt.

`name: "xlm-roberta-large-onnx"
backend: "onnxruntime"
dynamic_batching {
}
max_batch_size: 16

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [-1]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [-1]
  }
]

output [
  {
    name: "embeds"
    data_type: TYPE_FP32
    dims: [512]
  }
]

instance_group [
 {
    count: 1
    kind: KIND_GPU
 }
]

parameters {
  key: "execution_mode"
  value: { string_value: "1" }
}`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does my tirton service response time keep increasing at high QPM? #8317

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why does my tirton service response time keep increasing at high QPM? #8317

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions