Skip to content

Commit 2a93b61

Browse files
authored
Merge branch 'develop' into new_add_prompt_logprobs
2 parents fbd6840 + 90b0936 commit 2a93b61

File tree

24 files changed

+1072
-26
lines changed

24 files changed

+1072
-26
lines changed

custom_ops/setup_ops.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -618,6 +618,7 @@ def find_end_files(directory, end_str):
618618
"gpu_ops/get_position_ids_and_mask_encoder_batch.cu",
619619
"gpu_ops/limit_thinking_content_length_v1.cu",
620620
"gpu_ops/limit_thinking_content_length_v2.cu",
621+
"gpu_ops/update_attn_mask_offsets.cu",
621622
"gpu_ops/append_attn/mla_cache_kernel.cu",
622623
"gpu_ops/append_attn/get_block_shape_and_split_kv_block.cu",
623624
"gpu_ops/moe/tritonmoe_preprocess.cu",

docs/features/tool_calling.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Tool_Calling
2+
3+
This document describes how to configure the server in FastDeploy to use the tool parser, and how to invoke tools from the client.
4+
5+
---
6+
## Quickstart
7+
8+
### Starting FastDeploy with Tool Calling Enabled.
9+
10+
Launch the server with tool-calling enabled.This example uses ERNIE-4.5-21B-A3B.Leverage the ernie-x1 reasoning parser and the ernie-x1 tool-call parser from the fastdeploy directory to extract the model’s reasoning content, response content, and the tool-calling information:
11+
12+
```bash
13+
python -m fastdeploy.entrypoints.openai.api_server
14+
--model /models/ERNIE-4.5-21B-A3B \
15+
--port 8000 \
16+
--reasoning-parser ernie-x1 \
17+
--tool-call-parser ernie-x1
18+
```
19+
### Example of triggering tool calling
20+
Make a request containing the tool to trigger the model to use the available tool:
21+
```python
22+
curl -X POST http://0.0.0.0:8000/v1/chat/completions \
23+
-H "Content-Type: application/json" \
24+
-d '{
25+
"messages": [
26+
{
27+
"role": "user",
28+
"content": "What's the weather in Beijing?"
29+
}
30+
],
31+
"tools": [
32+
{
33+
"type": "function",
34+
"function": {
35+
"name": "get_weather",
36+
"description": "Get the current weather in a given location",
37+
"parameters": {
38+
"type": "object",
39+
"properties": {
40+
"location": {
41+
"type": "string",
42+
"description": "City name, for example: Beijing"
43+
},
44+
"unit": {
45+
"type": "string",
46+
"enum": ["c", "f"],
47+
"description": "Temperature units: c = Celsius, f = Fahrenheit"
48+
}
49+
},
50+
"required": ["location", "unit"],
51+
"additionalProperties": false
52+
},
53+
"strict": true
54+
}
55+
}
56+
],
57+
"stream": false
58+
}'
59+
```
60+
The example output is as follows. It shows that the model's output of the thought process `reasoning_content` and tool call information `tool_calls` was successfully parsed, and the current response content `content` is empty,`finish_reason` is `tool_calls`:
61+
```bash
62+
{
63+
"choices": [
64+
{
65+
"index": 0,
66+
"message": {
67+
"role": "assistant",
68+
"content": "",
69+
"multimodal_content": null,
70+
"reasoning_content": "User wants to ... ",
71+
"tool_calls": [
72+
{
73+
"id": "chatcmpl-tool-bc90641c67e44dbfb981a79bc986fbe5",
74+
"type": "function",
75+
"function": {
76+
"name": "get_weather",
77+
"arguments": "{\"location\": \"北京\", \"unit\": \"c\"}"
78+
}
79+
}
80+
],
81+
"finish_reason": "tool_calls"
82+
}
83+
}
84+
]
85+
}
86+
```
87+
88+
## Parallel Tool Calls
89+
If the model can generate parallel tool calls, FastDeploy will return a list:
90+
```bash
91+
tool_calls=[
92+
{"id": "...", "function": {...}},
93+
{"id": "...", "function": {...}}
94+
]
95+
```
96+
97+
## Requests containing tools in the conversation history
98+
If tool-call information exists in previous turns, you can construct the request as follows:
99+
```python
100+
curl -X POST "http://0.0.0.0:8000/v1/chat/completions" \
101+
-H "Content-Type: application/json" \
102+
-d '{
103+
"messages": [
104+
{
105+
"role": "user",
106+
"content": "Hello,What's the weather in Beijing?"
107+
},
108+
{
109+
"role": "assistant",
110+
"tool_calls": [
111+
{
112+
"id": "call_1",
113+
"type": "function",
114+
"function": {
115+
"name": "get_weather",
116+
"arguments": {
117+
"location": "Beijing",
118+
"unit": "c"
119+
}
120+
}
121+
}
122+
],
123+
"thoughts": "Users need to check today's weather in Beijing."
124+
},
125+
{
126+
"role": "tool",
127+
"tool_call_id": "call_1",
128+
"content": {
129+
"type": "text",
130+
"text": "{\"location\": \"北京\",\"temperature\": \"23\",\"weather\": \"\",\"unit\": \"c\"}"
131+
}
132+
}
133+
],
134+
"tools": [
135+
{
136+
"type": "function",
137+
"function": {
138+
"name": "get_weather",
139+
"description": "Determine weather in my location",
140+
"parameters": {
141+
"type": "object",
142+
"properties": {
143+
"location": {
144+
"type": "string",
145+
"description": "The city and state e.g. San Francisco, CA"
146+
},
147+
"unit": {
148+
"type": "string",
149+
"enum": [
150+
"c",
151+
"f"
152+
]
153+
}
154+
},
155+
"additionalProperties": false,
156+
"required": [
157+
"location",
158+
"unit"
159+
]
160+
},
161+
"strict": true
162+
}
163+
}
164+
],
165+
"stream": false
166+
}'
167+
```
168+
The parsed model output is as follows, containing the thought content `reasoning_content` and the response content `content`, with `finish_reason` set to stop:
169+
```bash
170+
{
171+
"choices": [
172+
{
173+
"index": 0,
174+
"message": {
175+
"role": "assistant",
176+
"content": "Today's weather in Beijing is sunny with a temperature of 23 degrees Celsius.",
177+
"reasoning_content": "User wants to ...",
178+
"tool_calls": null
179+
},
180+
"finish_reason": "stop"
181+
}
182+
]
183+
}
184+
```
185+
## Writing a Custom Tool Parser
186+
FastDeploy supports custom tool parser plugins. You can refer to the following address to create a `tool parser`: `fastdeploy/entrypoints/openai/tool_parser`
187+
188+
A custom parser should implement:
189+
``` python
190+
# import the required packages
191+
# register the tool parser to ToolParserManager
192+
@ToolParserManager.register_module("my-parser")
193+
class ToolParser:
194+
def __init__(self, tokenizer: AnyTokenizer):
195+
super().__init__(tokenizer)
196+
197+
# implement the tool parse for non-stream call
198+
def extract_tool_calls(self, model_output: str, request: ChatCompletionRequest) -> ExtractToolCallInformation:
199+
return ExtractedToolCallInformation(tools_called=False,tool_calls=[],content=text)
200+
201+
# implement the tool call parse for stream call
202+
def extract_tool_calls_streaming(
203+
self,
204+
previous_text: str,
205+
current_text: str,
206+
delta_text: str,
207+
previous_token_ids: Sequence[int],
208+
current_token_ids: Sequence[int],
209+
delta_token_ids: Sequence[int],
210+
request: ChatCompletionRequest,
211+
) -> DeltaMessage | None:
212+
return delta
213+
```
214+
Enable via:
215+
``` bash
216+
python -m fastdeploy.entrypoints.openai.api_server
217+
--model <model path>
218+
--tool-parser-plugin <absolute path of the plugin file>
219+
--tool-call-parser my-parser
220+
```
221+
222+
---

docs/parameters.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ When using FastDeploy to deploy models (including offline inference and service
5858
| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
5959
| ```max_encoder_cache``` | `int` | Maximum number of tokens in the encoder cache (use 0 to disable). |
6060
| ```max_processor_cache``` | `int` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
61+
| ```api_key``` |`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
6162

6263
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
6364

@@ -82,3 +83,41 @@ In actual inference, it's difficult for users to know how to properly configure
8283
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
8384

8485
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
86+
87+
## 4. ```api_key``` parameter description
88+
89+
Multi-value configuration method in startup. That takes precedence over environment variable configuration.
90+
```bash
91+
--api-key "key1"
92+
--api-key "key2"
93+
```
94+
Environment variable multi-value configuration method (use `,` separation):
95+
```bash
96+
export FD_API_KEY="key1,key2"
97+
```
98+
99+
When making requests using Curl, add the validation header. Any matching `api_key` will pass.
100+
101+
```bash
102+
curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
103+
-H "Content-Type: application/json" \
104+
-H "Authorization: Bearer key1" \
105+
-d '{
106+
"messages": [
107+
{"role": "user", "content":"你好"}
108+
],
109+
"stream": false,
110+
"return_token_ids": true,
111+
"chat_template_kwargs": {"enable_thinking": true}
112+
}'
113+
```
114+
The system will validate `key1` after parsing `Authorization: Bearer`.
115+
116+
When using the openai SDK for requests, pass the `api_key` parameter:
117+
118+
```python
119+
client = OpenAI(
120+
api_key="your-api-key-here",
121+
base_url="http://localhost:8000/v1"
122+
)
123+
```

0 commit comments

Comments
 (0)