PaddlePaddle
diff --git a/‎custom_ops/setup_ops.py‎
Lines changed: 1 addition & 0 deletions b/‎custom_ops/setup_ops.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/features/tool_calling.md‎
Lines changed: 222 additions & 0 deletions b/‎docs/features/tool_calling.md‎
Lines changed: 222 additions & 0 deletions
diff --git a/‎docs/parameters.md‎
Lines changed: 39 additions & 0 deletions b/‎docs/parameters.md‎
Lines changed: 39 additions & 0 deletions
@@ -618,6 +618,7 @@ def find_end_files(directory, end_str):
         "gpu_ops/get_position_ids_and_mask_encoder_batch.cu",
         "gpu_ops/limit_thinking_content_length_v1.cu",
         "gpu_ops/limit_thinking_content_length_v2.cu",
+        "gpu_ops/update_attn_mask_offsets.cu",
         "gpu_ops/append_attn/mla_cache_kernel.cu",
         "gpu_ops/append_attn/get_block_shape_and_split_kv_block.cu",
         "gpu_ops/moe/tritonmoe_preprocess.cu",
 
@@ -0,0 +1,222 @@
+# Tool_Calling
+
+This document describes how to configure the server in FastDeploy to use the tool parser, and how to invoke tools from the client.
+
+---
+## Quickstart
+
+### Starting FastDeploy with Tool Calling Enabled.
+
+Launch the server with tool-calling enabled.This example uses ERNIE-4.5-21B-A3B.Leverage the ernie-x1 reasoning parser and the ernie-x1 tool-call parser from the fastdeploy directory to extract the model’s reasoning content, response content, and the tool-calling information:
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server
+    --model /models/ERNIE-4.5-21B-A3B \
+    --port 8000 \
+    --reasoning-parser ernie-x1 \
+    --tool-call-parser ernie-x1
+```
+### Example of triggering tool calling
+Make a request containing the tool to trigger the model to use the available tool:
+```python
+curl -X POST http://0.0.0.0:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "What's the weather in Beijing?"
+      }
+    ],
+    "tools": [
+      {
+        "type": "function",
+        "function": {
+          "name": "get_weather",
+          "description": "Get the current weather in a given location",
+          "parameters": {
+            "type": "object",
+            "properties": {
+              "location": {
+                "type": "string",
+                "description": "City name, for example: Beijing"
+              },
+              "unit": {
+                "type": "string",
+                "enum": ["c", "f"],
+                "description": "Temperature units: c = Celsius, f = Fahrenheit"
+              }
+            },
+            "required": ["location", "unit"],
+            "additionalProperties": false
+          },
+          "strict": true
+        }
+      }
+    ],
+    "stream": false
+  }'
+```
+The example output is as follows. It shows that the model's output of the thought process `reasoning_content` and tool call information `tool_calls` was successfully parsed, and the current response content `content` is empty,`finish_reason` is `tool_calls`:
+```bash
+{
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "",
+                "multimodal_content": null,
+                "reasoning_content": "User wants to ... ",
+                "tool_calls": [
+                    {
+                        "id": "chatcmpl-tool-bc90641c67e44dbfb981a79bc986fbe5",
+                        "type": "function",
+                        "function": {
+                            "name": "get_weather",
+                            "arguments": "{\"location\": \"北京\", \"unit\": \"c\"}"
+                        }
+                    }
+                ],
+                "finish_reason": "tool_calls"
+            }
+        }
+    ]
+}
+```
+
+## Parallel Tool Calls
+If the model can generate parallel tool calls, FastDeploy will return a list:
+```bash
+tool_calls=[
+  {"id": "...", "function": {...}},
+  {"id": "...", "function": {...}}
+]
+```
+
+## Requests containing tools in the conversation history
+If tool-call information exists in previous turns, you can construct the request as follows:
+```python
+curl -X POST "http://0.0.0.0:8000/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {
+      "role": "user",
+      "content": "Hello,What's the weather in Beijing?"
+    },
+    {
+      "role": "assistant",
+      "tool_calls": [
+        {
+          "id": "call_1",
+          "type": "function",
+          "function": {
+            "name": "get_weather",
+            "arguments": {
+              "location": "Beijing",
+              "unit": "c"
+            }
+          }
+        }
+      ],
+      "thoughts": "Users need to check today's weather in Beijing."
+    },
+    {
+      "role": "tool",
+      "tool_call_id": "call_1",
+      "content": {
+        "type": "text",
+        "text": "{\"location\": \"北京\",\"temperature\": \"23\",\"weather\": \"晴\",\"unit\": \"c\"}"
+      }
+    }
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "description": "Determine weather in my location",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "location": {
+              "type": "string",
+              "description": "The city and state e.g. San Francisco, CA"
+            },
+            "unit": {
+              "type": "string",
+              "enum": [
+                "c",
+                "f"
+              ]
+            }
+          },
+          "additionalProperties": false,
+          "required": [
+            "location",
+            "unit"
+          ]
+        },
+        "strict": true
+      }
+    }
+  ],
+  "stream": false
+}'
+```
+The parsed model output is as follows, containing the thought content `reasoning_content` and the response content `content`, with `finish_reason` set to stop:
+```bash
+{
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "Today's weather in Beijing is sunny with a temperature of 23 degrees Celsius.",
+                "reasoning_content": "User wants to ...",
+                "tool_calls": null
+            },
+            "finish_reason": "stop"
+        }
+    ]
+}
+```
+## Writing a Custom Tool Parser
+FastDeploy supports custom tool parser plugins. You can refer to the following address to create a `tool parser`: `fastdeploy/entrypoints/openai/tool_parser`
+
+A custom parser should implement:
+``` python
+# import the required packages
+# register the tool parser to ToolParserManager
+@ToolParserManager.register_module("my-parser")
+class ToolParser:
+    def __init__(self, tokenizer: AnyTokenizer):
+      super().__init__(tokenizer)
+
+    # implement the tool parse for non-stream call
+    def extract_tool_calls(self, model_output: str, request: ChatCompletionRequest) -> ExtractToolCallInformation:
+      return ExtractedToolCallInformation(tools_called=False,tool_calls=[],content=text)
+
+    # implement the tool call parse for stream call
+    def extract_tool_calls_streaming(
+        self,
+        previous_text: str,
+        current_text: str,
+        delta_text: str,
+        previous_token_ids: Sequence[int],
+        current_token_ids: Sequence[int],
+        delta_token_ids: Sequence[int],
+        request: ChatCompletionRequest,
+    ) -> DeltaMessage | None:
+        return delta
+```
+Enable via:
+``` bash
+python -m fastdeploy.entrypoints.openai.api_server
+--model <model path>
+--tool-parser-plugin <absolute path of the plugin file>
+--tool-call-parser my-parser
+```
+
+---
@@ -58,6 +58,7 @@ When using FastDeploy to deploy models (including offline inference and service
 | ```load_choices```       | `str`      | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
 | ```max_encoder_cache```   | `int` | Maximum number of tokens in the encoder cache (use 0 to disable). |
 | ```max_processor_cache```  | `int` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
+| ```api_key```  |`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
 
 ## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
 
@@ -82,3 +83,41 @@ In actual inference, it's difficult for users to know how to properly configure
 When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
 
 To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
+
+## 4. ```api_key``` parameter description
+
+Multi-value configuration method in startup.  That takes precedence over environment variable configuration.
+```bash
+  --api-key "key1"
+  --api-key "key2"
+```
+Environment variable multi-value configuration method (use `,` separation):
+```bash
+  export FD_API_KEY="key1,key2"
+```
+
+When making requests using Curl, add the validation header. Any matching `api_key` will pass.
+
+```bash
+curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-H "Authorization: Bearer key1" \
+-d '{
+  "messages": [
+    {"role": "user", "content":"你好"}
+  ],
+  "stream": false,
+  "return_token_ids": true,
+  "chat_template_kwargs": {"enable_thinking": true}
+}'
+```
+The system will validate `key1` after parsing `Authorization: Bearer`.
+
+When using the openai SDK for requests, pass the `api_key` parameter:
+
+```python
+client = OpenAI(
+    api_key="your-api-key-here",
+    base_url="http://localhost:8000/v1"
+)
+```