You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# How to Configure Xtuner Concurrency to Improve Rollout Efficiency
2
+
3
+
During the Rollout phase of Xtuner, properly configuring concurrency-related parameters ensures that the inference engine maintains high load, fully utilizes hardware resources, and improves overall inference efficiency. This document introduces the main concurrency-related configuration options in Xtuner, explains their relationships, and provides best practice recommendations.
4
+
5
+
## Main Concurrency-Related Configuration Parameters
- Controls the maximum batch size that a single inference instance (such as a model process) can handle at one time.
9
+
- Larger batch sizes can improve GPU utilization, but excessively large values may cause out-of-memory errors or increased latency.
10
+
- It is recommended to adjust this parameter based on the model's context_length and actual GPU memory.
11
+
- Xtuner will provide recommended configurations for common models and context lengths in future releases.
12
+
13
+
2.**RolloutConfig.allow_over_concurrency_ratio**
14
+
- Controls the over-concurrency ratio for HTTP requests to ensure the inference engine is fully loaded.
15
+
16
+
3.**DataflowConfig.max_concurrent**
17
+
- Controls the maximum number of concurrent tasks in Dataflow. Dataflow acts as a single controller, distributing data to all rollout workers.
18
+
- Dataflow sends a batch of data each time; the actual number of data items sent at the same time is `max_concurrent * prompt_repeat_k`.
19
+
- It is recommended to set this slightly higher than the actual processing capability of the inference engine to ensure the inference queue always has tasks.
20
+
21
+
4.**RAY_MAX_CONCURRENCY**
22
+
- The maximum concurrency for the Ray backend, configured via environment variable. The default is 1024.
23
+
24
+
5.**httpx max connections**
25
+
- Controls the maximum number of concurrent connections that the HTTP client (such as RolloutWorker) can initiate to the inference service.
26
+
- It is recommended to set this equal to or slightly higher than `rollout_max_batch_size_per_instance`.
27
+
28
+
## Configuration Relationships and Recommendations
29
+
30
+
-**Recommended Configuration Process**:
31
+
1. Determine a reasonable value for `rollout_max_batch_size_per_instance` based on the model and hardware resources (e.g., 128, 256, 512, 1024). This parameter is optional; if not provided, Xtuner will use preset values based on `context_length`: concurrency is 1024 for `context_length` ≤ 4K, 512 for ≤ 16K, and 128 for ≤ 32K.
32
+
2. Set `DataflowConfig.max_concurrent`. It is recommended to use `rollout_max_batch_size_per_instance * num_of_infer_instance / prompt_repeat_k * allow_over_concurrency_ratio`, where `num_of_infer_instance` is the number of inference engine instances started (usually number of nodes / `tensor_parallel_size`).
33
+
3. Set the `RAY_MAX_CONCURRENCY` environment variable. It is recommended to set this equal to or slightly higher than `rollout_max_batch_size_per_instance * num_of_infer_instance`.
34
+
4. The default httpx max connections should be set to `rollout_max_batch_size_per_instance * allow_over_concurrency_ratio`.
35
+
36
+
-**Dynamic Adjustment**: You can dynamically adjust these parameters by monitoring the inference queue length, GPU utilization, and response latency to find the optimal concurrency configuration.
Copy file name to clipboardExpand all lines: xtuner/v1/ray/config/worker.py
+25-20Lines changed: 25 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -34,23 +34,22 @@ class RolloutConfig(BaseModel):
34
34
model_path (str | Path): Path to the inference model.
35
35
model_name (str): Model name for the backend engine.
36
36
tokenizer_path (str): Path to the model tokenizer. Defaults to "".
37
-
api_key (Optional[Union[List[str], str]]): API keys for rollout service.
38
-
Supports single key or list of keys. Defaults to None.
39
-
37
+
api_key (Optional[Union[List[str], str]]): API keys for rollout service. Supports single key or list of keys. Defaults to None.
38
+
api_port (Optional[int]): Port number for the rollout API server. If not set, it will find an available port starting from 8000. Defaults to 8000.
40
39
gpus_per_node (int): Number of GPUs per node. Defaults to 8.
41
40
dtype (str): Model data type ('bfloat16', 'float16', 'int8'). Defaults to "bfloat16".
42
41
gpu_memory_utilization (float): GPU memory utilization ratio. Defaults to 0.85.
43
42
random_seed (int): Random seed for reproducible generation. Defaults to 1024.
44
-
45
43
rollout_cross_node_comm (bool): Enable cross-node communication. Defaults to False.
44
+
rollout_max_batch_size_per_instance (int): Maximum batch size for the rollout worker. If not set, it will be determined automatically based on `context_length`. Defaults to 512.
45
+
allow_over_concurrency_ratio (float): Factor to allow over-concurrency in HTTP requests for the rollout worker to improve GPU utilization. Defaults to 1.2.
46
46
tensor_parallel_size (int): GPUs per inference engine (tensor parallelism). Defaults to 1.
47
47
expert_parallel_size (int): Experts per inference engine (expert parallelism). Defaults to 1.
48
-
49
48
enable_chunked_prefill (bool): Enable chunked prefill for memory efficiency. Defaults to False.
50
49
chunked_prefill_size (int): Chunk size for prefill operations. Defaults to 128.
51
50
skip_load_weights (bool): Skip weight loading for rollout worker. Defaults to False.
52
51
rollout_timeout (float): Timeout duration in seconds for rollout requests. Defaults to 3600.0.
53
-
52
+
context_length (int): Context length for the rollout worker.
54
53
launch_server_method (Literal["ray", "multiprocessing"]): Server launch method. Defaults to "ray".
55
54
system_prompt (Optional[str]): System prompt to guide generation behavior. Defaults to None.
56
55
extra_rollout_config (Optional[dict]): Backend-specific configurations using engine prefixes
@@ -114,20 +113,20 @@ class RolloutConfig(BaseModel):
114
113
help="Whether to enable cross-node communication for the rollout worker.",
115
114
),
116
115
] =False
117
-
rollout_max_batch_size: Annotated[
116
+
rollout_max_batch_size_per_instance: Annotated[
118
117
int,
119
118
Parameter(
120
119
group=infer_group,
121
120
help="Maximum batch size for the rollout worker. If not set, it will be determined automatically based on the model and GPU memory.",
122
121
),
123
122
] =512
124
-
prompt_repeat_k: Annotated[
125
-
int,
123
+
allow_over_concurrency_ratio: Annotated[
124
+
float,
126
125
Parameter(
127
126
group=infer_group,
128
-
help="Number of times to repeat the prompt for each request in the rollout worker.",
127
+
help="Factor to allow over concurrency in the http request for rollout worker to improve GPU utilization.",
0 commit comments