CyCle1024
diff --git a/‎README.md
+1 b/‎README.md
+1
diff --git a/‎README_zh-CN.md
+1 b/‎README_zh-CN.md
+1
diff --git a/‎docs/en/advance/chat_template.md
+3-3 b/‎docs/en/advance/chat_template.md
+3-3
diff --git a/‎docs/en/advance/long_context.md
+33-13 b/‎docs/en/advance/long_context.md
+33-13
diff --git a/‎docs/en/faq.md
+3-3 b/‎docs/en/faq.md
+3-3
diff --git a/‎docs/en/get_started.md
+3-3 b/‎docs/en/get_started.md
+3-3
diff --git a/‎docs/en/inference/pipeline.md
+8-8 b/‎docs/en/inference/pipeline.md
+8-8
diff --git a/‎docs/en/quantization/kv_quant.md
+11-11 b/‎docs/en/quantization/kv_quant.md
+11-11
diff --git a/‎docs/en/quantization/w4a16.md
+8-8 b/‎docs/en/quantization/w4a16.md
+8-8
diff --git a/‎docs/en/serving/api_server.md
+4-4 b/‎docs/en/serving/api_server.md
+4-4
@@ -108,6 +108,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>Llama3 (8B, 70B)</li>
   <li>InternLM (7B - 20B)</li>
   <li>InternLM2 (7B - 20B)</li>
+  <li>InternLM2.5 (7B)</li>
   <li>QWen (1.8B - 72B)</li>
   <li>QWen1.5 (0.5B - 110B)</li>
   <li>QWen1.5 - MoE (0.5B - 72B)</li>
 
@@ -109,6 +109,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>Llama3 (8B, 70B)</li>
   <li>InternLM (7B - 20B)</li>
   <li>InternLM2 (7B - 20B)</li>
+  <li>InternLM2.5 (7B)</li>
   <li>QWen (1.8B - 72B)</li>
   <li>QWen1.5 (0.5B - 110B)</li>
   <li>QWen1.5 - MoE (0.5B - 72B)</li>
 
@@ -36,14 +36,14 @@ LMDeploy supports two methods of adding chat templates:
   When using the CLI tool, you can pass in a custom chat template with `--chat-template`, for example.
 
   ```shell
-  lmdeploy serve api_server internlm/internlm2-chat-7b --chat-template ${JSON_FILE}
+  lmdeploy serve api_server internlm/internlm2_5-7b-chat --chat-template ${JSON_FILE}
   ```
 
   You can also pass it in through the interface function, for example.
 
   ```python
   from lmdeploy import ChatTemplateConfig, serve
-  serve('internlm/internlm2-chat-7b',
+  serve('internlm/internlm2_5-7b-chat',
         chat_template_config=ChatTemplateConfig.from_json('${JSON_FILE}'))
   ```
 
@@ -81,7 +81,7 @@ LMDeploy supports two methods of adding chat templates:
   from lmdeploy import ChatTemplateConfig, pipeline
 
   messages = [{'role': 'user', 'content': 'who are you?'}]
-  pipe = pipeline('internlm/internlm2-chat-7b',
+  pipe = pipeline('internlm/internlm2_5-7b-chat',
                   chat_template_config=ChatTemplateConfig('customized_model'))
   for response in pipe.stream_infer(messages):
       print(response.text, end='')
 
@@ -6,13 +6,18 @@ Long text extrapolation refers to the ability of LLM to handle data longer than
 
 You can enable the context length extrapolation abality by modifying the TurbomindEngineConfig. Edit the `session_len` to the expected length and change `rope_scaling_factor` to a number no less than 1.0.
 
-Here is an example:
+Take `internlm2_5-7b-chat-1m` as an example, which supports a context length of up to **1 million tokens**:
 
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
 
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=160000)
-pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1000000,
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)
+pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 gen_config = GenerationConfig(top_p=0.8,
                               top_k=40,
@@ -34,19 +39,26 @@ You can try the following code to test how many times LMDeploy can retrieval the
 import numpy as np
 from lmdeploy import pipeline
 from lmdeploy import TurbomindEngineConfig
+import time
 
-session_len = 160000
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=session_len)
-pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
+session_len = 1000000
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=session_len,
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)
+pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 
 
-def passkey_retrival(session_len, n_round=5):
+def passkey_retrieval(session_len, n_round=5):
     # create long context input
     tok = pipe.tokenizer
     task_description = 'There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.'
     garbage = 'The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.'
 
     for _ in range(n_round):
+        start = time.perf_counter()
         n_times = (session_len - 1000) // len(tok.encode(garbage))
         n_garbage_prefix = np.random.randint(0, n_times)
         n_garbage_suffix = n_times - n_garbage_prefix
@@ -67,11 +79,14 @@ def passkey_retrival(session_len, n_round=5):
         prompt = ' '.join(lines)
         response = pipe([prompt])
         print(pass_key, response)
+        end = time.perf_counter()
+        print(f'duration: {end - start} s')
 
-
-passkey_retrival(session_len, 5)
+passkey_retrieval(session_len, 5)
 ```
 
+This test takes approximately 364 seconds per round when conducted on A100-80G GPUs
+
 ### Needle In A Haystack
 
 [OpenCompass](https://github.com/open-compass/opencompass) offers very useful tools to perform needle-in-a-haystack evaluation. For specific instructions, please refer to the [guide](https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/needleinahaystack_eval.md).
@@ -86,14 +101,19 @@ from lmdeploy import TurbomindEngineConfig, pipeline
 import numpy as np
 
 # load model and tokenizer
-model_repoid_or_path = 'internlm/internlm2-chat-7b'
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=160000)
+model_repoid_or_path = 'internlm/internlm2_5-7b-chat-1m'
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1000000,
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)
 pipe = pipeline(model_repoid_or_path, backend_config=backend_config)
 tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)
 
 # get perplexity
 text = 'Use a long prompt to replace this sentence'
 input_ids = tokenizer.encode(text)
-loss = pipe.get_ppl(input_ids)[0]
-ppl = np.exp(loss)
+ppl = pipe.get_ppl(input_ids)[0]
+print(ppl)
 ```
@@ -55,7 +55,7 @@ from lmdeploy import pipeline, TurbomindEngineConfig
 
 backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
 
-pipe = pipeline('internlm/internlm2-chat-7b',
+pipe = pipeline('internlm/internlm2_5-7b-chat',
                 backend_config=backend_config)
 response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
 print(response)
@@ -65,10 +65,10 @@ If OOM occurs when you run CLI tools, please pass `--cache-max-entry-count` to d
 
 ```shell
 # chat command
-lmdeploy chat internlm/internlm2-chat-7b --cache-max-entry-count 0.2
+lmdeploy chat internlm/internlm2_5-7b-chat --cache-max-entry-count 0.2
 
 # server command
-lmdeploy serve api_server internlm/internlm2-chat-7b --cache-max-entry-count 0.2
+lmdeploy serve api_server internlm/internlm2_5-7b-chat --cache-max-entry-count 0.2
 ```
 
 ## Serve
 
@@ -22,7 +22,7 @@ pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_V
 
 ```python
 import lmdeploy
-pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
+pipe = lmdeploy.pipeline("internlm/internlm2_5-7b-chat")
 response = pipe(["Hi, pls intro yourself", "Shanghai is"])
 print(response)
 ```
@@ -52,7 +52,7 @@ LMDeploy CLI offers the following utilities, helping users experience LLM featur
 ### Inference with Command line Interface
 
 ```shell
-lmdeploy chat internlm/internlm2-chat-7b
+lmdeploy chat internlm/internlm2_5-7b-chat
 ```
 
 ### Serving with Web UI
@@ -63,7 +63,7 @@ LMDeploy adopts gradio to develop the online demo.
 # install dependencies
 pip install lmdeploy[serve]
 # launch gradio server
-lmdeploy serve gradio internlm/internlm2-chat-7b
+lmdeploy serve gradio internlm/internlm2_5-7b-chat
 ```
 
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -11,7 +11,7 @@ You can overview the detailed pipeline API in [this](https://lmdeploy.readthedoc
 ```python
 from lmdeploy import pipeline
 
-pipe = pipeline('internlm/internlm2-chat-7b')
+pipe = pipeline('internlm/internlm2_5-7b-chat')
 response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
 print(response)
 ```
@@ -30,7 +30,7 @@ There have been alterations to the strategy for setting the k/v cache ratio thro
    # decrease the ratio of the k/v cache occupation to 20%
    backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
 
-   pipe = pipeline('internlm/internlm2-chat-7b',
+   pipe = pipeline('internlm/internlm2_5-7b-chat',
                    backend_config=backend_config)
    response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
    print(response)
@@ -46,7 +46,7 @@ There have been alterations to the strategy for setting the k/v cache ratio thro
 from lmdeploy import pipeline, TurbomindEngineConfig
 
 backend_config = TurbomindEngineConfig(tp=2)
-pipe = pipeline('internlm/internlm2-chat-7b',
+pipe = pipeline('internlm/internlm2_5-7b-chat',
                 backend_config=backend_config)
 response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
 print(response)
@@ -62,7 +62,7 @@ gen_config = GenerationConfig(top_p=0.8,
                               top_k=40,
                               temperature=0.8,
                               max_new_tokens=1024)
-pipe = pipeline('internlm/internlm2-chat-7b',
+pipe = pipeline('internlm/internlm2_5-7b-chat',
                 backend_config=backend_config)
 response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
                 gen_config=gen_config)
@@ -79,7 +79,7 @@ gen_config = GenerationConfig(top_p=0.8,
                               top_k=40,
                               temperature=0.8,
                               max_new_tokens=1024)
-pipe = pipeline('internlm/internlm2-chat-7b',
+pipe = pipeline('internlm/internlm2_5-7b-chat',
                 backend_config=backend_config)
 prompts = [[{
     'role': 'user',
@@ -103,7 +103,7 @@ gen_config = GenerationConfig(top_p=0.8,
                               top_k=40,
                               temperature=0.8,
                               max_new_tokens=1024)
-pipe = pipeline('internlm/internlm2-chat-7b',
+pipe = pipeline('internlm/internlm2_5-7b-chat',
                 backend_config=backend_config)
 prompts = [[{
     'role': 'user',
@@ -121,7 +121,7 @@ for item in pipe.stream_infer(prompts, gen_config=gen_config):
 ```python
 from transformers import AutoTokenizer
 from lmdeploy import pipeline
-model_repoid_or_path='internlm/internlm2-chat-7b'
+model_repoid_or_path='internlm/internlm2_5-7b-chat'
 pipe = pipeline(model_repoid_or_path)
 tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)
 
@@ -150,7 +150,7 @@ gen_config = GenerationConfig(top_p=0.8,
                               top_k=40,
                               temperature=0.8,
                               max_new_tokens=1024)
-pipe = pipeline('internlm/internlm2-chat-7b',
+pipe = pipeline('internlm/internlm2_5-7b-chat',
                 backend_config=backend_config)
 prompts = [[{
     'role': 'user',
 
@@ -38,30 +38,30 @@ Applying kv quantization and inference via LMDeploy is quite straightforward. Si
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
 engine_config = TurbomindEngineConfig(quant_policy=8)
-pipe = pipeline("internlm/internlm2-chat-7b", backend_config=engine_config)
+pipe = pipeline("internlm/internlm2_5-7b-chat", backend_config=engine_config)
 response = pipe(["Hi, pls intro yourself", "Shanghai is"])
 print(response)
 ```
 
 ### Serving
 
 ```shell
-lmdeploy serve api_server internlm/internlm2-chat-7b --quant-policy 8
+lmdeploy serve api_server internlm/internlm2_5-7b-chat --quant-policy 8
 ```
 
 ## Evaluation
 
 We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:
 
-| -           | -       | -             | llama2-7b-chat | -       | -       | internlm2-chat-7b | -       | -       | qwen1.5-7b-chat | -       | -       |
-| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | --------------- | ------- | ------- |
-| dataset     | version | metric        | kv fp16        | kv int8 | kv int4 | kv fp16           | kv int8 | kv int4 | fp16            | kv int8 | kv int4 |
-| ceval       | -       | naive_average | 28.42          | 27.96   | 27.58   | 60.45             | 60.88   | 60.28   | 70.56           | 70.49   | 68.62   |
-| mmlu        | -       | naive_average | 35.64          | 35.58   | 34.79   | 63.91             | 64      | 62.36   | 61.48           | 61.56   | 60.65   |
-| triviaqa    | 2121ce  | score         | 56.09          | 56.13   | 53.71   | 58.73             | 58.7    | 58.18   | 44.62           | 44.77   | 44.04   |
-| gsm8k       | 1d7fe4  | accuracy      | 28.2           | 28.05   | 27.37   | 70.13             | 69.75   | 66.87   | 54.97           | 56.41   | 54.74   |
-| race-middle | 9a54b6  | accuracy      | 41.57          | 41.78   | 41.23   | 88.93             | 88.93   | 88.93   | 87.33           | 87.26   | 86.28   |
-| race-high   | 9a54b6  | accuracy      | 39.65          | 39.77   | 40.77   | 85.33             | 85.31   | 84.62   | 82.53           | 82.59   | 82.02   |
+| -           | -       | -             | llama2-7b-chat | -       | -       | internlm2-chat-7b | -       | -       | internlm2.5-chat-7b | -       | -       | qwen1.5-7b-chat | -       | -       |
+| ----------- | ------- | ------------- | -------------- | ------- | ------- | ----------------- | ------- | ------- | ------------------- | ------- | ------- | --------------- | ------- | ------- |
+| dataset     | version | metric        | kv fp16        | kv int8 | kv int4 | kv fp16           | kv int8 | kv int4 | kv fp16             | kv int8 | kv int4 | fp16            | kv int8 | kv int4 |
+| ceval       | -       | naive_average | 28.42          | 27.96   | 27.58   | 60.45             | 60.88   | 60.28   | 78.06               | 77.87   | 77.05   | 70.56           | 70.49   | 68.62   |
+| mmlu        | -       | naive_average | 35.64          | 35.58   | 34.79   | 63.91             | 64      | 62.36   | 72.30               | 72.27   | 71.17   | 61.48           | 61.56   | 60.65   |
+| triviaqa    | 2121ce  | score         | 56.09          | 56.13   | 53.71   | 58.73             | 58.7    | 58.18   | 65.09               | 64.87   | 63.28   | 44.62           | 44.77   | 44.04   |
+| gsm8k       | 1d7fe4  | accuracy      | 28.2           | 28.05   | 27.37   | 70.13             | 69.75   | 66.87   | 85.67               | 85.44   | 83.78   | 54.97           | 56.41   | 54.74   |
+| race-middle | 9a54b6  | accuracy      | 41.57          | 41.78   | 41.23   | 88.93             | 88.93   | 88.93   | 92.76               | 92.83   | 92.55   | 87.33           | 87.26   | 86.28   |
+| race-high   | 9a54b6  | accuracy      | 39.65          | 39.77   | 40.77   | 85.33             | 85.31   | 84.62   | 90.51               | 90.42   | 90.42   | 82.53           | 82.59   | 82.02   |
 
 For detailed evaluation methods, please refer to [this](../benchmark/evaluate_with_opencompass.md) guide. Remember to pass `quant_policy` to the inference engine in the config file.
 
 
@@ -33,8 +33,8 @@ This article comprises the following sections:
 A single command execution is all it takes to quantize the model. The resulting quantized weights are then stored in the $WORK_DIR directory.
 
 ```shell
-export HF_MODEL=internlm/internlm2-chat-7b
-export WORK_DIR=internlm/internlm2-chat-7b-4bit
+export HF_MODEL=internlm/internlm2_5-7b-chat
+export WORK_DIR=internlm/internlm2_5-7b-chat-4bit
 
 lmdeploy lite auto_awq \
    $HF_MODEL \
@@ -48,10 +48,10 @@ lmdeploy lite auto_awq \
   --work-dir $WORK_DIR
 ```
 
-Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model, the command can be condensed as:
+Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) model, the command can be condensed as:
 
 ```shell
-lmdeploy lite auto_awq internlm/internlm2-chat-7b --work-dir internlm2-chat-7b-4bit
+lmdeploy lite auto_awq internlm/internlm2_5-7b-chat --work-dir internlm2_5-7b-chat-4bit
 ```
 
 **Note:**
@@ -63,13 +63,13 @@ Upon completing quantization, you can engage with the model efficiently using a
 For example, you can initiate a conversation with it via the command line:
 
 ```shell
-lmdeploy chat ./internlm2-chat-7b-4bit --model-format awq
+lmdeploy chat ./internlm2_5-7b-chat-4bit --model-format awq
 ```
 
 Alternatively, you can start the gradio server and interact with the model through the web at `http://{ip_addr}:{port`
 
 ```shell
-lmdeploy serve gradio ./internlm2-chat-7b-4bit --server_name {ip_addr} --server_port {port} --model-format awq
+lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server_name {ip_addr} --server_port {port} --model-format awq
 ```
 
 ## Evaluation
@@ -83,7 +83,7 @@ Trying the following codes, you can perform the batched offline inference with t
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
 engine_config = TurbomindEngineConfig(model_format='awq')
-pipe = pipeline("./internlm2-chat-7b-4bit", backend_config=engine_config)
+pipe = pipeline("./internlm2_5-7b-chat-4bit", backend_config=engine_config)
 response = pipe(["Hi, pls intro yourself", "Shanghai is"])
 print(response)
 ```
@@ -115,7 +115,7 @@ print(response)
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 
 ```shell
-lmdeploy serve api_server ./internlm2-chat-7b-4bit --backend turbomind --model-format awq
+lmdeploy serve api_server ./internlm2_5-7b-chat-4bit --backend turbomind --model-format awq
 ```
 
 The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:
 
@@ -11,12 +11,12 @@ Finally, we showcase how to integrate the service into a WebUI, providing you wi
 
 ## Launch Service
 
-Take the [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
+Take the [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
 
 ### Option 1: Launching with lmdeploy CLI
 
 ```shell
-lmdeploy serve api_server internlm/internlm2-chat-7b --server-port 23333
+lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333
 ```
 
 The arguments of `api_server` can be viewed through the command `lmdeploy serve api_server -h`, for instance, `--tp` to set tensor parallelism, `--session-len` to specify the max length of the context window, `--cache-max-entry-count` to adjust the GPU mem ratio for k/v cache etc.
@@ -32,14 +32,14 @@ docker run --runtime nvidia --gpus all \
     -p 23333:23333 \
     --ipc=host \
     openmmlab/lmdeploy:latest \
-    lmdeploy serve api_server internlm/internlm2-chat-7b
+    lmdeploy serve api_server internlm/internlm2_5-7b-chat
 ```
 
 The parameters of `api_server` are the same with that mentioned in "[option 1](#option-1-launching-with-lmdeploy-cli)" section
 
 ### Option 3: Deploying to Kubernetes cluster
 
-Connect to a running Kubernetes cluster and deploy the internlm2-chat-7b model service with [kubectl](https://kubernetes.io/docs/reference/kubectl/) command-line tool (replace `<your token>` with your huggingface hub token):
+Connect to a running Kubernetes cluster and deploy the internlm2_5-7b-chat model service with [kubectl](https://kubernetes.io/docs/reference/kubectl/) command-line tool (replace `<your token>` with your huggingface hub token):
 
 ```shell
 sed 's/{{HUGGING_FACE_HUB_TOKEN}}/<your token>/' k8s/deployment.yaml | kubectl create -f - \