Skip to content

Commit 89c2aaf

Browse files
[Doc]: Update docs for internlm2.5 (InternLM#1887)
* update docs to internlm2.5 * update long context * update * update * update kv cache docs * fix --------- Co-authored-by: lvhan028 <[email protected]>
1 parent 1f5dd4e commit 89c2aaf

22 files changed

+156
-110
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
108108
<li>Llama3 (8B, 70B)</li>
109109
<li>InternLM (7B - 20B)</li>
110110
<li>InternLM2 (7B - 20B)</li>
111+
<li>InternLM2.5 (7B)</li>
111112
<li>QWen (1.8B - 72B)</li>
112113
<li>QWen1.5 (0.5B - 110B)</li>
113114
<li>QWen1.5 - MoE (0.5B - 72B)</li>

README_zh-CN.md

+1
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
109109
<li>Llama3 (8B, 70B)</li>
110110
<li>InternLM (7B - 20B)</li>
111111
<li>InternLM2 (7B - 20B)</li>
112+
<li>InternLM2.5 (7B)</li>
112113
<li>QWen (1.8B - 72B)</li>
113114
<li>QWen1.5 (0.5B - 110B)</li>
114115
<li>QWen1.5 - MoE (0.5B - 72B)</li>

docs/en/advance/chat_template.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,14 @@ LMDeploy supports two methods of adding chat templates:
3636
When using the CLI tool, you can pass in a custom chat template with `--chat-template`, for example.
3737

3838
```shell
39-
lmdeploy serve api_server internlm/internlm2-chat-7b --chat-template ${JSON_FILE}
39+
lmdeploy serve api_server internlm/internlm2_5-7b-chat --chat-template ${JSON_FILE}
4040
```
4141

4242
You can also pass it in through the interface function, for example.
4343

4444
```python
4545
from lmdeploy import ChatTemplateConfig, serve
46-
serve('internlm/internlm2-chat-7b',
46+
serve('internlm/internlm2_5-7b-chat',
4747
chat_template_config=ChatTemplateConfig.from_json('${JSON_FILE}'))
4848
```
4949

@@ -81,7 +81,7 @@ LMDeploy supports two methods of adding chat templates:
8181
from lmdeploy import ChatTemplateConfig, pipeline
8282

8383
messages = [{'role': 'user', 'content': 'who are you?'}]
84-
pipe = pipeline('internlm/internlm2-chat-7b',
84+
pipe = pipeline('internlm/internlm2_5-7b-chat',
8585
chat_template_config=ChatTemplateConfig('customized_model'))
8686
for response in pipe.stream_infer(messages):
8787
print(response.text, end='')

docs/en/advance/long_context.md

+33-13
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,18 @@ Long text extrapolation refers to the ability of LLM to handle data longer than
66

77
You can enable the context length extrapolation abality by modifying the TurbomindEngineConfig. Edit the `session_len` to the expected length and change `rope_scaling_factor` to a number no less than 1.0.
88

9-
Here is an example:
9+
Take `internlm2_5-7b-chat-1m` as an example, which supports a context length of up to **1 million tokens**:
1010

1111
```python
1212
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
1313

14-
backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=160000)
15-
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
14+
backend_config = TurbomindEngineConfig(
15+
rope_scaling_factor=2.5,
16+
session_len=1000000,
17+
max_batch_size=1,
18+
cache_max_entry_count=0.7,
19+
tp=4)
20+
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
1621
prompt = 'Use a long prompt to replace this sentence'
1722
gen_config = GenerationConfig(top_p=0.8,
1823
top_k=40,
@@ -34,19 +39,26 @@ You can try the following code to test how many times LMDeploy can retrieval the
3439
import numpy as np
3540
from lmdeploy import pipeline
3641
from lmdeploy import TurbomindEngineConfig
42+
import time
3743

38-
session_len = 160000
39-
backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=session_len)
40-
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
44+
session_len = 1000000
45+
backend_config = TurbomindEngineConfig(
46+
rope_scaling_factor=2.5,
47+
session_len=session_len,
48+
max_batch_size=1,
49+
cache_max_entry_count=0.7,
50+
tp=4)
51+
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
4152

4253

43-
def passkey_retrival(session_len, n_round=5):
54+
def passkey_retrieval(session_len, n_round=5):
4455
# create long context input
4556
tok = pipe.tokenizer
4657
task_description = 'There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.'
4758
garbage = 'The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.'
4859

4960
for _ in range(n_round):
61+
start = time.perf_counter()
5062
n_times = (session_len - 1000) // len(tok.encode(garbage))
5163
n_garbage_prefix = np.random.randint(0, n_times)
5264
n_garbage_suffix = n_times - n_garbage_prefix
@@ -67,11 +79,14 @@ def passkey_retrival(session_len, n_round=5):
6779
prompt = ' '.join(lines)
6880
response = pipe([prompt])
6981
print(pass_key, response)
82+
end = time.perf_counter()
83+
print(f'duration: {end - start} s')
7084

71-
72-
passkey_retrival(session_len, 5)
85+
passkey_retrieval(session_len, 5)
7386
```
7487

88+
This test takes approximately 364 seconds per round when conducted on A100-80G GPUs
89+
7590
### Needle In A Haystack
7691

7792
[OpenCompass](https://github.com/open-compass/opencompass) offers very useful tools to perform needle-in-a-haystack evaluation. For specific instructions, please refer to the [guide](https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/needleinahaystack_eval.md).
@@ -86,14 +101,19 @@ from lmdeploy import TurbomindEngineConfig, pipeline
86101
import numpy as np
87102

88103
# load model and tokenizer
89-
model_repoid_or_path = 'internlm/internlm2-chat-7b'
90-
backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=160000)
104+
model_repoid_or_path = 'internlm/internlm2_5-7b-chat-1m'
105+
backend_config = TurbomindEngineConfig(
106+
rope_scaling_factor=2.5,
107+
session_len=1000000,
108+
max_batch_size=1,
109+
cache_max_entry_count=0.7,
110+
tp=4)
91111
pipe = pipeline(model_repoid_or_path, backend_config=backend_config)
92112
tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)
93113

94114
# get perplexity
95115
text = 'Use a long prompt to replace this sentence'
96116
input_ids = tokenizer.encode(text)
97-
loss = pipe.get_ppl(input_ids)[0]
98-
ppl = np.exp(loss)
117+
ppl = pipe.get_ppl(input_ids)[0]
118+
print(ppl)
99119
```

docs/en/faq.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ from lmdeploy import pipeline, TurbomindEngineConfig
5555

5656
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
5757

58-
pipe = pipeline('internlm/internlm2-chat-7b',
58+
pipe = pipeline('internlm/internlm2_5-7b-chat',
5959
backend_config=backend_config)
6060
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
6161
print(response)
@@ -65,10 +65,10 @@ If OOM occurs when you run CLI tools, please pass `--cache-max-entry-count` to d
6565

6666
```shell
6767
# chat command
68-
lmdeploy chat internlm/internlm2-chat-7b --cache-max-entry-count 0.2
68+
lmdeploy chat internlm/internlm2_5-7b-chat --cache-max-entry-count 0.2
6969

7070
# server command
71-
lmdeploy serve api_server internlm/internlm2-chat-7b --cache-max-entry-count 0.2
71+
lmdeploy serve api_server internlm/internlm2_5-7b-chat --cache-max-entry-count 0.2
7272
```
7373

7474
## Serve

docs/en/get_started.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_V
2222

2323
```python
2424
import lmdeploy
25-
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
25+
pipe = lmdeploy.pipeline("internlm/internlm2_5-7b-chat")
2626
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
2727
print(response)
2828
```
@@ -52,7 +52,7 @@ LMDeploy CLI offers the following utilities, helping users experience LLM featur
5252
### Inference with Command line Interface
5353

5454
```shell
55-
lmdeploy chat internlm/internlm2-chat-7b
55+
lmdeploy chat internlm/internlm2_5-7b-chat
5656
```
5757

5858
### Serving with Web UI
@@ -63,7 +63,7 @@ LMDeploy adopts gradio to develop the online demo.
6363
# install dependencies
6464
pip install lmdeploy[serve]
6565
# launch gradio server
66-
lmdeploy serve gradio internlm/internlm2-chat-7b
66+
lmdeploy serve gradio internlm/internlm2_5-7b-chat
6767
```
6868

6969
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

docs/en/inference/pipeline.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ You can overview the detailed pipeline API in [this](https://lmdeploy.readthedoc
1111
```python
1212
from lmdeploy import pipeline
1313

14-
pipe = pipeline('internlm/internlm2-chat-7b')
14+
pipe = pipeline('internlm/internlm2_5-7b-chat')
1515
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
1616
print(response)
1717
```
@@ -30,7 +30,7 @@ There have been alterations to the strategy for setting the k/v cache ratio thro
3030
# decrease the ratio of the k/v cache occupation to 20%
3131
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
3232

33-
pipe = pipeline('internlm/internlm2-chat-7b',
33+
pipe = pipeline('internlm/internlm2_5-7b-chat',
3434
backend_config=backend_config)
3535
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
3636
print(response)
@@ -46,7 +46,7 @@ There have been alterations to the strategy for setting the k/v cache ratio thro
4646
from lmdeploy import pipeline, TurbomindEngineConfig
4747

4848
backend_config = TurbomindEngineConfig(tp=2)
49-
pipe = pipeline('internlm/internlm2-chat-7b',
49+
pipe = pipeline('internlm/internlm2_5-7b-chat',
5050
backend_config=backend_config)
5151
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
5252
print(response)
@@ -62,7 +62,7 @@ gen_config = GenerationConfig(top_p=0.8,
6262
top_k=40,
6363
temperature=0.8,
6464
max_new_tokens=1024)
65-
pipe = pipeline('internlm/internlm2-chat-7b',
65+
pipe = pipeline('internlm/internlm2_5-7b-chat',
6666
backend_config=backend_config)
6767
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
6868
gen_config=gen_config)
@@ -79,7 +79,7 @@ gen_config = GenerationConfig(top_p=0.8,
7979
top_k=40,
8080
temperature=0.8,
8181
max_new_tokens=1024)
82-
pipe = pipeline('internlm/internlm2-chat-7b',
82+
pipe = pipeline('internlm/internlm2_5-7b-chat',
8383
backend_config=backend_config)
8484
prompts = [[{
8585
'role': 'user',
@@ -103,7 +103,7 @@ gen_config = GenerationConfig(top_p=0.8,
103103
top_k=40,
104104
temperature=0.8,
105105
max_new_tokens=1024)
106-
pipe = pipeline('internlm/internlm2-chat-7b',
106+
pipe = pipeline('internlm/internlm2_5-7b-chat',
107107
backend_config=backend_config)
108108
prompts = [[{
109109
'role': 'user',
@@ -121,7 +121,7 @@ for item in pipe.stream_infer(prompts, gen_config=gen_config):
121121
```python
122122
from transformers import AutoTokenizer
123123
from lmdeploy import pipeline
124-
model_repoid_or_path='internlm/internlm2-chat-7b'
124+
model_repoid_or_path='internlm/internlm2_5-7b-chat'
125125
pipe = pipeline(model_repoid_or_path)
126126
tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)
127127

@@ -150,7 +150,7 @@ gen_config = GenerationConfig(top_p=0.8,
150150
top_k=40,
151151
temperature=0.8,
152152
max_new_tokens=1024)
153-
pipe = pipeline('internlm/internlm2-chat-7b',
153+
pipe = pipeline('internlm/internlm2_5-7b-chat',
154154
backend_config=backend_config)
155155
prompts = [[{
156156
'role': 'user',

docs/en/quantization/kv_quant.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -38,30 +38,30 @@ Applying kv quantization and inference via LMDeploy is quite straightforward. Si
3838
```python
3939
from lmdeploy import pipeline, TurbomindEngineConfig
4040
engine_config = TurbomindEngineConfig(quant_policy=8)
41-
pipe = pipeline("internlm/internlm2-chat-7b", backend_config=engine_config)
41+
pipe = pipeline("internlm/internlm2_5-7b-chat", backend_config=engine_config)
4242
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
4343
print(response)
4444
```
4545

4646
### Serving
4747

4848
```shell
49-
lmdeploy serve api_server internlm/internlm2-chat-7b --quant-policy 8
49+
lmdeploy serve api_server internlm/internlm2_5-7b-chat --quant-policy 8
5050
```
5151

5252
## Evaluation
5353

5454
We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:
5555

56-
| - | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
57-
| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | --------------- | ------- | ------- |
58-
| dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
59-
| ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
60-
| mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
61-
| triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
62-
| gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
63-
| race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
64-
| race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |
56+
| - | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | internlm2.5-chat-7b | - | - | qwen1.5-7b-chat | - | - |
57+
| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | ------------------- | ------- | ------- | --------------- | ------- | ------- |
58+
| dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
59+
| ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 78.06 | 77.87 | 77.05 | 70.56 | 70.49 | 68.62 |
60+
| mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 72.30 | 72.27 | 71.17 | 61.48 | 61.56 | 60.65 |
61+
| triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 65.09 | 64.87 | 63.28 | 44.62 | 44.77 | 44.04 |
62+
| gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 85.67 | 85.44 | 83.78 | 54.97 | 56.41 | 54.74 |
63+
| race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 92.76 | 92.83 | 92.55 | 87.33 | 87.26 | 86.28 |
64+
| race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 90.51 | 90.42 | 90.42 | 82.53 | 82.59 | 82.02 |
6565

6666
For detailed evaluation methods, please refer to [this](../benchmark/evaluate_with_opencompass.md) guide. Remember to pass `quant_policy` to the inference engine in the config file.
6767

docs/en/quantization/w4a16.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ This article comprises the following sections:
3333
A single command execution is all it takes to quantize the model. The resulting quantized weights are then stored in the $WORK_DIR directory.
3434

3535
```shell
36-
export HF_MODEL=internlm/internlm2-chat-7b
37-
export WORK_DIR=internlm/internlm2-chat-7b-4bit
36+
export HF_MODEL=internlm/internlm2_5-7b-chat
37+
export WORK_DIR=internlm/internlm2_5-7b-chat-4bit
3838

3939
lmdeploy lite auto_awq \
4040
$HF_MODEL \
@@ -48,10 +48,10 @@ lmdeploy lite auto_awq \
4848
--work-dir $WORK_DIR
4949
```
5050

51-
Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model, the command can be condensed as:
51+
Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) model, the command can be condensed as:
5252

5353
```shell
54-
lmdeploy lite auto_awq internlm/internlm2-chat-7b --work-dir internlm2-chat-7b-4bit
54+
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat --work-dir internlm2_5-7b-chat-4bit
5555
```
5656

5757
**Note:**
@@ -63,13 +63,13 @@ Upon completing quantization, you can engage with the model efficiently using a
6363
For example, you can initiate a conversation with it via the command line:
6464

6565
```shell
66-
lmdeploy chat ./internlm2-chat-7b-4bit --model-format awq
66+
lmdeploy chat ./internlm2_5-7b-chat-4bit --model-format awq
6767
```
6868

6969
Alternatively, you can start the gradio server and interact with the model through the web at `http://{ip_addr}:{port`
7070

7171
```shell
72-
lmdeploy serve gradio ./internlm2-chat-7b-4bit --server_name {ip_addr} --server_port {port} --model-format awq
72+
lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server_name {ip_addr} --server_port {port} --model-format awq
7373
```
7474

7575
## Evaluation
@@ -83,7 +83,7 @@ Trying the following codes, you can perform the batched offline inference with t
8383
```python
8484
from lmdeploy import pipeline, TurbomindEngineConfig
8585
engine_config = TurbomindEngineConfig(model_format='awq')
86-
pipe = pipeline("./internlm2-chat-7b-4bit", backend_config=engine_config)
86+
pipe = pipeline("./internlm2_5-7b-chat-4bit", backend_config=engine_config)
8787
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
8888
print(response)
8989
```
@@ -115,7 +115,7 @@ print(response)
115115
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
116116

117117
```shell
118-
lmdeploy serve api_server ./internlm2-chat-7b-4bit --backend turbomind --model-format awq
118+
lmdeploy serve api_server ./internlm2_5-7b-chat-4bit --backend turbomind --model-format awq
119119
```
120120

121121
The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:

docs/en/serving/api_server.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@ Finally, we showcase how to integrate the service into a WebUI, providing you wi
1111

1212
## Launch Service
1313

14-
Take the [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
14+
Take the [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
1515

1616
### Option 1: Launching with lmdeploy CLI
1717

1818
```shell
19-
lmdeploy serve api_server internlm/internlm2-chat-7b --server-port 23333
19+
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333
2020
```
2121

2222
The arguments of `api_server` can be viewed through the command `lmdeploy serve api_server -h`, for instance, `--tp` to set tensor parallelism, `--session-len` to specify the max length of the context window, `--cache-max-entry-count` to adjust the GPU mem ratio for k/v cache etc.
@@ -32,14 +32,14 @@ docker run --runtime nvidia --gpus all \
3232
-p 23333:23333 \
3333
--ipc=host \
3434
openmmlab/lmdeploy:latest \
35-
lmdeploy serve api_server internlm/internlm2-chat-7b
35+
lmdeploy serve api_server internlm/internlm2_5-7b-chat
3636
```
3737

3838
The parameters of `api_server` are the same with that mentioned in "[option 1](#option-1-launching-with-lmdeploy-cli)" section
3939

4040
### Option 3: Deploying to Kubernetes cluster
4141

42-
Connect to a running Kubernetes cluster and deploy the internlm2-chat-7b model service with [kubectl](https://kubernetes.io/docs/reference/kubectl/) command-line tool (replace `<your token>` with your huggingface hub token):
42+
Connect to a running Kubernetes cluster and deploy the internlm2_5-7b-chat model service with [kubectl](https://kubernetes.io/docs/reference/kubectl/) command-line tool (replace `<your token>` with your huggingface hub token):
4343

4444
```shell
4545
sed 's/{{HUGGING_FACE_HUB_TOKEN}}/<your token>/' k8s/deployment.yaml | kubectl create -f - \

0 commit comments

Comments
 (0)