You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/parameters.md
+39Lines changed: 39 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,6 +58,7 @@ When using FastDeploy to deploy models (including offline inference and service
58
58
|```load_choices```|`str`| By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
59
59
|```max_encoder_cache```|`int`| Maximum number of tokens in the encoder cache (use 0 to disable). |
60
60
|```max_processor_cache```|`int`| Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
61
+
|```api_key```|`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
61
62
62
63
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
63
64
@@ -82,3 +83,41 @@ In actual inference, it's difficult for users to know how to properly configure
82
83
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
83
84
84
85
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
86
+
87
+
## 4. ```api_key``` parameter description
88
+
89
+
Multi-value configuration method in startup. That takes precedence over environment variable configuration.
0 commit comments