Skip to content

Commit b15ec5f

Browse files
authored
Update doc for prefix caching (InternLM#1597)
* update turbomind config * introduce prefix cache * rollback index
1 parent fbffc31 commit b15ec5f

File tree

2 files changed

+27
-9
lines changed

2 files changed

+27
-9
lines changed

docs/en/inference/turbomind_config.md

+12-3
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22

33
TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file `triton_models/weights/config.ini` that is closely related to inference performance.
44

5-
If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.
5+
If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-2x-config) to familiarize yourself with the configuration details.
66

77
## TurboMind 2.x config
88

9-
Take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
9+
Take the `llama-2-7b-chat` model as an example. In TurboMind 2.x, its config.ini content is as follows:
1010

1111
```toml
1212
[llama]
@@ -33,6 +33,7 @@ step_length = 1
3333
cache_max_entry_count = 0.5
3434
cache_block_seq_len = 128
3535
cache_chunk_size = 1
36+
enable_prefix_caching = False
3637
quant_policy = 0
3738
max_position_embeddings = 2048
3839
rope_scaling_factor = 0.0
@@ -74,7 +75,7 @@ The maximum batch size is still set through `max_batch_size`. But its default va
7475

7576
k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
7677

77-
TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
78+
TurboMind 2.x has implemented Paged Attention, managing the k/v cache in blocks.
7879

7980
`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
8081

@@ -96,6 +97,14 @@ The `cache_chunk_size` indicates the size of the k/v cache chunk to be allocated
9697
- When the value is -1, `cache_max_entry_count` number of k/v cache blocks are allocated.
9798
- When the value is 0, `sqrt(cache_max_entry_count)` number of k/v cache blocks are allocated.
9899

100+
### prefix caching switch
101+
102+
Prefix caching feature can be controlled by setting the `enable_prefix_caching` parameter. When set to `True`, it indicates that the feature is enabled, and when set to `False`, it indicates that the feature is disabled. The default value is `False`.
103+
104+
Prefix caching feature is mainly applicable to scenarios where multiple requests have the same prompt prefix (such as system prompt). The k/v blocks of this identical prefix part will be cached and reused by multiple requests, thereby saving the overhead of redundant computations and improving inference performance. The longer the identical prompt prefix, the greater the performance improvement.
105+
106+
Since k/v block is the smallest granularity for reuse in prefix caching, if the identical prompt prefix is less than one block (prefix length \< cache_block_seq_len), there will be no improvement in inference performance.
107+
99108
### kv quantization and inference switch
100109

101110
- `quant_policy=4` means 4bit k/v quantization and inference

docs/zh_cn/inference/turbomind_config.md

+15-6
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22

33
TurboMind 是 LMDeploy 的推理引擎,在用它推理 LLM 模型时,需要把输入模型转成 TurboMind 模型。在 TurboMind 的模型文件夹中,除模型权重外,TurboMind 模型还包括其他一些文件,其中最重要的是和推理性能息息相关的配置文件`triton_models/weights/config.ini`
44

5-
如果你使用的是 LMDeploy 0.0.x 版本,请参考[turbomind 1.0 配置](#turbomind-10-配置)章节,了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本,请阅读[turbomind 2.0 配置](#turbomind-20-配置)了解配置细节。
5+
如果你使用的是 LMDeploy 0.0.x 版本,请参考[turbomind 1.0 配置](#turbomind-10-配置)章节,了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本,请阅读[turbomind 2.x 配置](#turbomind-2x-配置)了解配置细节。
66

7-
## TurboMind 2.0 配置
7+
## TurboMind 2.x 配置
88

9-
`llama-2-7b-chat` 模型为例,在 TurboMind 2.0 中,它的`config.ini`内容如下:
9+
`llama-2-7b-chat` 模型为例,在 TurboMind 2.x 中,它的`config.ini`内容如下:
1010

1111
```toml
1212
[llama]
@@ -33,6 +33,7 @@ step_length = 1
3333
cache_max_entry_count = 0.5
3434
cache_block_seq_len = 128
3535
cache_chunk_size = 1
36+
enable_prefix_caching = False
3637
quant_policy = 0
3738
max_position_embeddings = 2048
3839
rope_scaling_factor = 0.0
@@ -57,7 +58,7 @@ rope_theta = 10000.0
5758
size_per_head = 128
5859
```
5960

60-
和 TurboMind 1.0 config 相比,TurboMind 2.0 config 中的模型属性部分和 1.0 一致,但推理参数发生了变化。
61+
和 TurboMind 1.0 config 相比,TurboMind 2.x config 中的模型属性部分和 1.0 一致,但推理参数发生了变化。
6162

6263
在接下来的章节中,我们重点介绍推理参数。
6364

@@ -70,13 +71,13 @@ size_per_head = 128
7071
### 批处理大小
7172

7273
仍通过 `max_batch_size` 设置最大批处理量。默认值由原来的 32 改成 64。
73-
在 TurboMind 2.0 中,`max_batch_size``cache_max_entry_count`无关。
74+
在 TurboMind 2.x 中,`max_batch_size``cache_max_entry_count`无关。
7475

7576
### k/v 缓存大小
7677

7778
`cache_block_seq_len``cache_max_entry_count` 用来调节 k/v cache 的内存大小。
7879

79-
TurboMind 2.0 实现了 Paged Attention,按块管理 k/v cache。
80+
TurboMind 2.x 实现了 Paged Attention,按块管理 k/v cache。
8081

8182
`cache_block_seq_len` 表示一块 k/v block 可以存放的 token 序列长度,默认 128。TurboMind 按照以下公式计算 k/v block 的内存大小:
8283

@@ -98,6 +99,14 @@ cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_da
9899
- 当值为 -1 时,开辟 `cache_max_entry_count` 个 k/v cache 块
99100
- 当值为 0 时,时,开辟 `sqrt(cache_max_entry_count)` 个 k/v cache 块
100101

102+
### 前缀缓存开关
103+
104+
`enable_prefix_caching`是前缀缓存(Prefix Caching)功能的开关。值为`True`时表示开启,`False`表示关闭,默认为`False`
105+
106+
前缀缓存功能主要适用于多个请求具有相同的prompt前缀(比如system prompt)的场景,该相同前缀部分的 k/v block 会被缓存起来,被多个请求重复利用,从而节省了重复计算的开销,提高推理性能。相同prompt前缀长度越长,性能提升越大。
107+
108+
由于前缀缓存对 k/v 重复利用的最小粒度是block,如果相同prompt前缀不足一个block(前缀长度\<`cache_block_seq_len`),则推理性能不会有提升。
109+
101110
### kv 量化推理开关
102111

103112
`quant_policy`是 kv 量化和推理开关。

0 commit comments

Comments
 (0)