Skip to content

Commit 90b0936

Browse files
LiqinruiGliqinrui
andauthored
[Docs] add api-key usage instructions (#4902)
* [Docs] add api-key usage instructions * [Docs] add api-key usage instructions --------- Co-authored-by: liqinrui <[email protected]>
1 parent 41c0bef commit 90b0936

File tree

2 files changed

+76
-0
lines changed

2 files changed

+76
-0
lines changed

docs/parameters.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ When using FastDeploy to deploy models (including offline inference and service
5858
| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
5959
| ```max_encoder_cache``` | `int` | Maximum number of tokens in the encoder cache (use 0 to disable). |
6060
| ```max_processor_cache``` | `int` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
61+
| ```api_key``` |`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
6162

6263
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
6364

@@ -82,3 +83,41 @@ In actual inference, it's difficult for users to know how to properly configure
8283
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
8384

8485
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
86+
87+
## 4. ```api_key``` parameter description
88+
89+
Multi-value configuration method in startup. That takes precedence over environment variable configuration.
90+
```bash
91+
--api-key "key1"
92+
--api-key "key2"
93+
```
94+
Environment variable multi-value configuration method (use `,` separation):
95+
```bash
96+
export FD_API_KEY="key1,key2"
97+
```
98+
99+
When making requests using Curl, add the validation header. Any matching `api_key` will pass.
100+
101+
```bash
102+
curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
103+
-H "Content-Type: application/json" \
104+
-H "Authorization: Bearer key1" \
105+
-d '{
106+
"messages": [
107+
{"role": "user", "content":"你好"}
108+
],
109+
"stream": false,
110+
"return_token_ids": true,
111+
"chat_template_kwargs": {"enable_thinking": true}
112+
}'
113+
```
114+
The system will validate `key1` after parsing `Authorization: Bearer`.
115+
116+
When using the openai SDK for requests, pass the `api_key` parameter:
117+
118+
```python
119+
client = OpenAI(
120+
api_key="your-api-key-here",
121+
base_url="http://localhost:8000/v1"
122+
)
123+
```

docs/zh/parameters.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@
5656
| ```load_choices``` | `str` | 默认使用"default" loader进行权重加载,加载torch权重/权重加速需开启 "default_v1"|
5757
| ```max_encoder_cache``` | `int` | 编码器缓存的最大token数(使用0表示禁用)。 |
5858
| ```max_processor_cache``` | `int` | 处理器缓存的最大字节数(以GiB为单位,使用0表示禁用)。 |
59+
| ```api_key``` |`dict[str]`| 校验服务请求头中的API密钥,支持传入多个密钥;与环境变量`FD_API_KEY`中的值效果相同,且优先级高于环境变量配置|
5960

6061
## 1. KVCache分配与```num_gpu_blocks_override``````block_size```的关系?
6162

@@ -79,3 +80,39 @@ FastDeploy在推理过程中,显存被```模型权重```、```预分配KVCache
7980
当启用 `enable_chunked_prefill` 时,服务通过动态分块处理长输入序列,显著提升GPU资源利用率。在此模式下,原有 `max_num_batched_tokens` 参数不再约束预填充阶段的批处理token数量(限制单次prefill的token数量),因此引入 `max_num_partial_prefills` 参数,专门用于限制同时处理的分块批次数。
8081

8182
为优化短请求的调度优先级,新增 `max_long_partial_prefills``long_prefill_token_threshold` 参数组合。前者限制单个预填充批次中的长请求数量,后者定义长请求的token阈值。系统会优先保障短请求的批处理空间,从而在混合负载场景下降低短请求延迟,同时保持整体吞吐稳定。
83+
84+
## 4. ```api_key``` 参数使用说明
85+
86+
启动参数多值配置方式, 优先级高于环境变量中配置。
87+
```bash
88+
--api-key "key1"
89+
--api-key "key2"
90+
```
91+
环境变量多值配置方式,使用逗号分隔
92+
```bash
93+
export FD_API_KEY="key1,key2"
94+
```
95+
96+
使用 Curl 命令请求时,增加 API_KEY头信息,进行请求合法性校验。匹配任一```api_key```即可。
97+
```bash
98+
curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
99+
-H "Content-Type: application/json" \
100+
-H "Authorization: Bearer key1" \
101+
-d '{
102+
"messages": [
103+
{"role": "user", "content":"你好"}
104+
],
105+
"stream": false,
106+
"return_token_ids": true,
107+
"chat_template_kwargs": {"enable_thinking": true}
108+
}'
109+
```
110+
解析`Authorization: Bearer``key1`进行校验。
111+
112+
使用 openai sdk 进行请求时,需要传`api_key`参数。
113+
```python
114+
client = OpenAI(
115+
api_key="your-api-key-here",
116+
base_url="http://localhost:8000/v1"
117+
)
118+
```

0 commit comments

Comments
 (0)