You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/advance/long_context.md
+33-13
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,18 @@ Long text extrapolation refers to the ability of LLM to handle data longer than
6
6
7
7
You can enable the context length extrapolation abality by modifying the TurbomindEngineConfig. Edit the `session_len` to the expected length and change `rope_scaling_factor` to a number no less than 1.0.
8
8
9
-
Here is an example:
9
+
Take `internlm2_5-7b-chat-1m` as an example, which supports a context length of up to **1 million tokens**:
10
10
11
11
```python
12
12
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
task_description ='There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.'
47
58
garbage ='The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.'
This test takes approximately 364 seconds per round when conducted on A100-80G GPUs
89
+
75
90
### Needle In A Haystack
76
91
77
92
[OpenCompass](https://github.com/open-compass/opencompass) offers very useful tools to perform needle-in-a-haystack evaluation. For specific instructions, please refer to the [guide](https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/needleinahaystack_eval.md).
@@ -86,14 +101,19 @@ from lmdeploy import TurbomindEngineConfig, pipeline
We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:
For detailed evaluation methods, please refer to [this](../benchmark/evaluate_with_opencompass.md) guide. Remember to pass `quant_policy` to the inference engine in the config file.
Copy file name to clipboardExpand all lines: docs/en/quantization/w4a16.md
+8-8
Original file line number
Diff line number
Diff line change
@@ -33,8 +33,8 @@ This article comprises the following sections:
33
33
A single command execution is all it takes to quantize the model. The resulting quantized weights are then stored in the $WORK_DIR directory.
34
34
35
35
```shell
36
-
export HF_MODEL=internlm/internlm2-chat-7b
37
-
export WORK_DIR=internlm/internlm2-chat-7b-4bit
36
+
export HF_MODEL=internlm/internlm2_5-7b-chat
37
+
export WORK_DIR=internlm/internlm2_5-7b-chat-4bit
38
38
39
39
lmdeploy lite auto_awq \
40
40
$HF_MODEL \
@@ -48,10 +48,10 @@ lmdeploy lite auto_awq \
48
48
--work-dir $WORK_DIR
49
49
```
50
50
51
-
Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model, the command can be condensed as:
51
+
Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) model, the command can be condensed as:
52
52
53
53
```shell
54
-
lmdeploy lite auto_awq internlm/internlm2-chat-7b --work-dir internlm2-chat-7b-4bit
54
+
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat --work-dir internlm2_5-7b-chat-4bit
55
55
```
56
56
57
57
**Note:**
@@ -63,13 +63,13 @@ Upon completing quantization, you can engage with the model efficiently using a
63
63
For example, you can initiate a conversation with it via the command line:
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
Copy file name to clipboardExpand all lines: docs/en/serving/api_server.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -11,12 +11,12 @@ Finally, we showcase how to integrate the service into a WebUI, providing you wi
11
11
12
12
## Launch Service
13
13
14
-
Take the [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
14
+
Take the [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
The arguments of `api_server` can be viewed through the command `lmdeploy serve api_server -h`, for instance, `--tp` to set tensor parallelism, `--session-len` to specify the max length of the context window, `--cache-max-entry-count` to adjust the GPU mem ratio for k/v cache etc.
@@ -32,14 +32,14 @@ docker run --runtime nvidia --gpus all \
The parameters of `api_server` are the same with that mentioned in "[option 1](#option-1-launching-with-lmdeploy-cli)" section
39
39
40
40
### Option 3: Deploying to Kubernetes cluster
41
41
42
-
Connect to a running Kubernetes cluster and deploy the internlm2-chat-7b model service with [kubectl](https://kubernetes.io/docs/reference/kubectl/) command-line tool (replace `<your token>` with your huggingface hub token):
42
+
Connect to a running Kubernetes cluster and deploy the internlm2_5-7b-chat model service with [kubectl](https://kubernetes.io/docs/reference/kubectl/) command-line tool (replace `<your token>` with your huggingface hub token):
43
43
44
44
```shell
45
45
sed 's/{{HUGGING_FACE_HUB_TOKEN}}/<your token>/' k8s/deployment.yaml | kubectl create -f - \
0 commit comments