Update doc for prefix caching (InternLM#1597)

ispobock · web-flow · commit b15ec5f22a0d · 2024-05-24T22:24:17.000+08:00
* update turbomind config

* introduce prefix cache

* rollback index
diff --git a/docs/en/inference/turbomind_config.md b/docs/en/inference/turbomind_config.md
@@ -2,11 +2,11 @@
 
 TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file `triton_models/weights/config.ini` that is closely related to inference performance.
 
-If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.
+If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-2x-config) to familiarize yourself with the configuration details.
 
 ## TurboMind 2.x config
 
-Take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
+Take the `llama-2-7b-chat` model as an example. In TurboMind 2.x, its config.ini content is as follows:
 
 ```toml
 [llama]
@@ -33,6 +33,7 @@ step_length = 1
 cache_max_entry_count = 0.5
 cache_block_seq_len = 128
 cache_chunk_size = 1
+enable_prefix_caching = False
 quant_policy = 0
 max_position_embeddings = 2048
 rope_scaling_factor = 0.0
@@ -74,7 +75,7 @@ The maximum batch size is still set through `max_batch_size`. But its default va
 
 k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
 
-TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
+TurboMind 2.x has implemented Paged Attention, managing the k/v cache in blocks.
 
 `cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
 
@@ -96,6 +97,14 @@ The `cache_chunk_size` indicates the size of the k/v cache chunk to be allocated
 - When the value is -1, `cache_max_entry_count` number of k/v cache blocks are allocated.
 - When the value is 0, `sqrt(cache_max_entry_count)` number of k/v cache blocks are allocated.
 
+### prefix caching switch
+
+Prefix caching feature can be controlled by setting the `enable_prefix_caching` parameter. When set to `True`, it indicates that the feature is enabled, and when set to `False`, it indicates that the feature is disabled. The default value is `False`.
+
+Prefix caching feature is mainly applicable to scenarios where multiple requests have the same prompt prefix (such as system prompt). The k/v blocks of this identical prefix part will be cached and reused by multiple requests, thereby saving the overhead of redundant computations and improving inference performance. The longer the identical prompt prefix, the greater the performance improvement.
+
+Since k/v block is the smallest granularity for reuse in prefix caching, if the identical prompt prefix is less than one block (prefix length \< cache_block_seq_len), there will be no improvement in inference performance.
+
 ### kv quantization and inference switch
 
 - `quant_policy=4` means 4bit k/v quantization and inference
diff --git a/docs/zh_cn/inference/turbomind_config.md b/docs/zh_cn/inference/turbomind_config.md
@@ -2,11 +2,11 @@
 
 TurboMind 是 LMDeploy 的推理引擎，在用它推理 LLM 模型时，需要把输入模型转成 TurboMind 模型。在 TurboMind 的模型文件夹中，除模型权重外，TurboMind 模型还包括其他一些文件，其中最重要的是和推理性能息息相关的配置文件`triton_models/weights/config.ini`。
 
-如果你使用的是 LMDeploy 0.0.x 版本，请参考[turbomind 1.0 配置](#turbomind-10-配置)章节，了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本，请阅读[turbomind 2.0 配置](#turbomind-20-配置)了解配置细节。
+如果你使用的是 LMDeploy 0.0.x 版本，请参考[turbomind 1.0 配置](#turbomind-10-配置)章节，了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本，请阅读[turbomind 2.x 配置](#turbomind-2x-配置)了解配置细节。
 
-## TurboMind 2.0 配置
+## TurboMind 2.x 配置
 
-以 `llama-2-7b-chat` 模型为例，在 TurboMind 2.0 中，它的`config.ini`内容如下：
+以 `llama-2-7b-chat` 模型为例，在 TurboMind 2.x 中，它的`config.ini`内容如下：
 
 ```toml
 [llama]
@@ -33,6 +33,7 @@ step_length = 1
 cache_max_entry_count = 0.5
 cache_block_seq_len = 128
 cache_chunk_size = 1
+enable_prefix_caching = False
 quant_policy = 0
 max_position_embeddings = 2048
 rope_scaling_factor = 0.0
@@ -57,7 +58,7 @@ rope_theta = 10000.0
 size_per_head = 128
 ```
 
-和 TurboMind 1.0 config 相比，TurboMind 2.0 config 中的模型属性部分和 1.0 一致，但推理参数发生了变化。
+和 TurboMind 1.0 config 相比，TurboMind 2.x config 中的模型属性部分和 1.0 一致，但推理参数发生了变化。
 
 在接下来的章节中，我们重点介绍推理参数。
 
@@ -70,13 +71,13 @@ size_per_head = 128
 ### 批处理大小
 
 仍通过 `max_batch_size` 设置最大批处理量。默认值由原来的 32 改成 64。
-在 TurboMind 2.0 中，`max_batch_size` 和 `cache_max_entry_count`无关。
+在 TurboMind 2.x 中，`max_batch_size` 和 `cache_max_entry_count`无关。
 
 ### k/v 缓存大小
 
 `cache_block_seq_len` 和 `cache_max_entry_count` 用来调节 k/v cache 的内存大小。
 
-TurboMind 2.0 实现了 Paged Attention，按块管理 k/v cache。
+TurboMind 2.x 实现了 Paged Attention，按块管理 k/v cache。
 
 `cache_block_seq_len` 表示一块 k/v block 可以存放的 token 序列长度，默认 128。TurboMind 按照以下公式计算 k/v block 的内存大小：
 
@@ -98,6 +99,14 @@ cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_da
 - 当值为 -1 时，开辟 `cache_max_entry_count` 个 k/v cache 块
 - 当值为 0 时，时，开辟 `sqrt(cache_max_entry_count)` 个 k/v cache 块
 
+### 前缀缓存开关
+
+`enable_prefix_caching`是前缀缓存（Prefix Caching）功能的开关。值为`True`时表示开启，`False`表示关闭，默认为`False`。
+
+前缀缓存功能主要适用于多个请求具有相同的prompt前缀（比如system prompt）的场景，该相同前缀部分的 k/v block 会被缓存起来，被多个请求重复利用，从而节省了重复计算的开销，提高推理性能。相同prompt前缀长度越长，性能提升越大。
+
+由于前缀缓存对 k/v 重复利用的最小粒度是block，如果相同prompt前缀不足一个block（前缀长度\<`cache_block_seq_len`），则推理性能不会有提升。
+
 ### kv 量化推理开关
 
 `quant_policy`是 kv 量化和推理开关。