modelscope
diff --git a/‎.dev_scripts/ci_container_test.sh
+1-1 b/‎.dev_scripts/ci_container_test.sh
+1-1
diff --git a/‎README.md
+4-3 b/‎README.md
+4-3
diff --git a/‎README_CN.md
+4-3 b/‎README_CN.md
+4-3
diff --git a/‎docs/source/GetStarted/SWIFT安装.md
+2-1 b/‎docs/source/GetStarted/SWIFT安装.md
+2-1
diff --git a/‎docs/source/GetStarted/快速开始.md
+1-1 b/‎docs/source/GetStarted/快速开始.md
+1-1
diff --git a/‎docs/source/Instruction/GRPO.md
+2-1 b/‎docs/source/Instruction/GRPO.md
+2-1
diff --git a/‎docs/source/Instruction/命令行参数.md
+3-2 b/‎docs/source/Instruction/命令行参数.md
+3-2
diff --git a/‎docs/source_en/GetStarted/Quick-start.md
+1-1 b/‎docs/source_en/GetStarted/Quick-start.md
+1-1
diff --git a/‎docs/source_en/GetStarted/SWIFT-installation.md
+2-1 b/‎docs/source_en/GetStarted/SWIFT-installation.md
+2-1
diff --git a/‎docs/source_en/Instruction/Command-line-parameters.md
+2-1 b/‎docs/source_en/Instruction/Command-line-parameters.md
+2-1
diff --git a/‎docs/source_en/Instruction/GRPO.md
+1 b/‎docs/source_en/Instruction/GRPO.md
+1
diff --git a/‎requirements/install_all.sh
+1-1 b/‎requirements/install_all.sh
+1-1
diff --git a/‎swift/llm/argument/eval_args.py
+2-8 b/‎swift/llm/argument/eval_args.py
+2-8
@@ -26,7 +26,7 @@ if [ "$MODELSCOPE_SDK_DEBUG" == "True" ]; then
 
     # test with install
     pip install .
-    pip install auto_gptq bitsandbytes deepspeed==0.14.* -U -i https://mirrors.aliyun.com/pypi/simple/
+    pip install auto_gptq bitsandbytes deepspeed -U -i https://mirrors.aliyun.com/pypi/simple/
 else
     echo "Running case in release image, run case directly!"
 fi
 
@@ -57,7 +57,7 @@ You can contact us and communicate with us by adding our group:
 ## 📝 Introduction
 🍲 ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It currently supports the training (pre-training, fine-tuning, human alignment), inference, evaluation, quantization, and deployment of 450+ large models and 150+ multi-modal large models. These large language models (LLMs) include models such as Qwen2.5, InternLM3, GLM4, Llama3.3, Mistral, DeepSeek-R1, Yi1.5, TeleChat2, Baichuan2, and Gemma2. The multi-modal LLMs include models such as Qwen2.5-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, and GOT-OCR2.
 
-🍔 In addition, ms-swift gathers the latest training technologies, including LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM and LMDeploy, and supports the quantization of large models and multi-modal large models using technologies such as GPTQ, AWQ, and BNB. To help researchers and developers fine-tune and apply large models more easily, ms-swift also provides a Gradio-based Web-UI interface and a wealth of best practices.
+🍔 Additionally, ms-swift incorporates the latest training technologies, including lightweight techniques such as LoRA, QLoRA, Llama-Pro, LongLoRA, GaLore, Q-GaLore, LoRA+, LISA, DoRA, FourierFt, ReFT, UnSloth, and Liger, as well as human alignment training methods like DPO, GRPO, RM, PPO, KTO, CPO, SimPO, and ORPO. ms-swift supports acceleration of inference, evaluation, and deployment modules using vLLM and LMDeploy, and it supports model quantization with technologies like GPTQ, AWQ, and BNB. Furthermore, ms-swift offers a Gradio-based Web UI and a wealth of best practices.
 
 **Why choose ms-swift?**
 
@@ -67,7 +67,7 @@ You can contact us and communicate with us by adding our group:
 - 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
 - **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
 - **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM, PPO, GRPO for both pure text and multi-modal large models.
+- **RLHF Training**: Supports human alignment training methods such as DPO, GRPO, RM, PPO, KTO, CPO, SimPO, ORPO for both pure text and multi-modal large models.
 - 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
 - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
 - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
@@ -115,9 +115,10 @@ Running Environment:
 | modelscope   | >=1.19               |             |                                           |
 | peft         | >=0.11.0,<0.15.0     |             |                                           |
 | trl          | >=0.13,<0.16         | 0.14.0      | RLHF                                      |
+| deepspeed    | >=0.14 |  | Training                                  |
 | vllm         | >=0.5.1              | 0.6.5       | Inference/Deployment/Evaluation           |
 | lmdeploy     | lmdeploy>=0.5,<0.6.5 | 0.6.4       | Inference/Deployment/Evaluation           |
-| deepspeed    | >=0.14 |  | Training                                  |
+| evalscope | | >=0.11 | Evaluation |
 
 For more optional dependencies, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh).
 
 
@@ -55,7 +55,7 @@
 ## 📝 简介
 🍲 ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架，现已支持450+大模型与150+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中大模型包括：Qwen2.5、InternLM3、GLM4、Llama3.3、Mistral、DeepSeek-R1、Yi1.5、TeleChat2、Baichuan2、Gemma2等模型，多模态大模型包括：Qwen2.5-VL、Qwen2-Audio、Llama3.2-Vision、Llava、InternVL2.5、MiniCPM-V-2.6、GLM4v、Xcomposer2.5、Yi-VL、DeepSeek-VL2、Phi3.5-Vision、GOT-OCR2等模型。
 
-🍔 除此之外，ms-swift汇集了最新的训练技术，包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等。ms-swift支持使用vLLM和LMDeploy对推理、评测和部署模块进行加速，并支持使用GPTQ、AWQ、BNB等技术对大模型和多模态大模型进行量化。为了帮助研究者和开发者更轻松地微调和应用大模型，ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。
+🍔 除此之外，ms-swift汇集了最新的训练技术，包括LoRA、QLoRA、Llama-Pro、LongLoRA、GaLore、Q-GaLore、LoRA+、LISA、DoRA、FourierFt、ReFT、UnSloth、和Liger等轻量化训练技术，以及DPO、GRPO、RM、PPO、KTO、CPO、SimPO、ORPO等人类对齐训练方法。ms-swift支持使用vLLM和LMDeploy对推理、评测和部署模块进行加速，并支持使用GPTQ、AWQ、BNB等技术对大模型进行量化。ms-swift还提供了基于Gradio的Web-UI界面及丰富的最佳实践。
 
 **为什么选择ms-swift？**
 - 🍎 **模型类型**：支持450+纯文本大模型、**150+多模态大模型**以及All-to-All全模态模型、序列分类模型、Embedding模型**训练到部署全流程**。
@@ -64,7 +64,7 @@
 - 🍊 **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - **分布式训练**：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - **量化训练**：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO、GRPO等人类对齐训练方法。
+- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
 - 🍓 **多模态训练**：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
 - **界面训练**：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - **插件化与拓展**：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
@@ -110,9 +110,10 @@ pip install -e .
 | modelscope | >=1.19 |  ||
 | peft | >=0.11.0,<0.15.0 | ||
 | trl | >=0.13,<0.16 | 0.14.0 |RLHF|
+| deepspeed | >=0.14 |  |训练|
 | vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
 | lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
-| deepspeed | >=0.14 |  |训练|
+| evalscope |  | >=0.11 |评测|
 
 更多可选依赖可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh)。
 
 
@@ -61,9 +61,10 @@ pip install ms-swift==2.*
 | modelscope | >=1.19 |  ||
 | peft | >=0.11.0,<0.15.0 | ||
 | trl | >=0.13,<0.16 | 0.14.0 |RLHF|
+| deepspeed | >=0.14 |  |训练|
 | vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
 | lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
-| deepspeed | >=0.14 |  |训练|
+| evalscope |  | >=0.11 |评测|
 
 更多可选依赖可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh)。
 
 
@@ -8,7 +8,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
 - 🍊 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - 分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - 量化训练：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- RLHF训练：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO、GRPO等人类对齐训练方法。
+- RLHF训练：支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
 - 🍓 多模态训练：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
 - 界面训练：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - 插件化与拓展：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
 
@@ -88,7 +88,8 @@ A conversation between User and Assistant. The user asks a question, and the Ass
 - reward_funcs: 奖励函数，根据模型生成结果进行打分，内置accuracy、format、cosine和repetition四个rule-based函数，详细见 swift/plugin/orm.py 文件
 - reward_weights: 每个奖励函数的权重。必须与奖励函数的数量匹配。如果为 None，则所有奖励的权重都相等，为`1.0`
   - 提示：如果GRPO训练中包含`--reward_model`，则其加在奖励函数的最后位置
-- log_completions: 是否记录训练中的模型生成内容，搭配 `--report_to wandb` 使用，默认为False
+- log_completions: 是否记录训练中的模型生成内容，搭配 `--report_to wandb` 使用。默认为False
+  - 提示：若没有设置`--report_to wandb`，则会在checkpoint中创建`completions.jsonl`来存储生成内容
 - use_vllm: 是否使用vLLM作为采样的生成后端，默认为False，建议使用加快训练速度
 - vllm_device: 设置vLLM部署的设备，默认为`auto`, 即未被使用的第一张显卡，使用`cuda:x`来设置特定的卡。
 - vllm_gpu_memory_utilization: vLLM透传参数
 
@@ -369,7 +369,8 @@ reward模型参数将在PPO、GRPO中使用。
 - reward_funcs: GRPO算法奖励函数，可选项为`accuracy`、`format`、`cosine` 和 `repetition`，见swift/plugin/orm.py。你也可以在plugin中自定义自己的奖励函数。默认为`[]`
 - reward_weights: 每个奖励函数的权重。必须与奖励函数的数量匹配。如果为 None，则所有奖励的权重都相等，为`1.0`
   - 提示：如果GRPO训练中包含`--reward_model`，则其加在奖励函数的最后位置
-- log_completions: 是否记录训练中的模型生成内容，搭配 `--report_to wandb` 使用，默认为False
+- log_completions: 是否记录训练中的模型生成内容，搭配 `--report_to wandb` 使用。默认为False
+  - 提示：若没有设置`--report_to wandb`，则会在checkpoint中创建`completions.jsonl`来存储生成内容
 - use_vllm: 是否使用vLLM作为GRPO生成的infer_backend，默认为False
 - vllm_device: 设置vLLM部署的设备，比如部署在卡0上，则`cuda:1`, 默认为`auto`, 即使用最后一张卡
 - vllm_gpu_memory_utilization: vllm透传参数，默认为0.9
@@ -446,7 +447,7 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
   - 注意：默认评测会使用`~/.cache/opencompass`下的数据集，在指定本参数后会直接使用当前目录下的data文件夹
 - temperature: 覆盖生成参数，默认为0
 - verbose: 该参数在本地拉起部署并评估时传入DeployArguments中，默认`False`
-- eval_num_proc: 评测时客户端最大并发数，文本评测默认256，多模态默认16
+- eval_num_proc: 评测时客户端最大并发数，默认为16
 - 🔥eval_url: 评测url，例如`http://localhost:8000/v1`。例子可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/eval/eval_url)。默认为None，采用本地部署评估
 
 ### 导出参数
 
@@ -8,7 +8,7 @@ ms-swift is a comprehensive training and deployment framework for large language
 - 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
 - Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
 - Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM, PPO, GRPO for both text-based and multimodal large models.
+- RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
 - 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
 - Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
 - Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
 
@@ -62,9 +62,10 @@ You can view the image [here](https://modelscope.cn/docs/intro/environment-setup
 | modelscope   | >=1.19               |             |                                           |
 | peft         | >=0.11.0,<0.15.0     |             |                                           |
 | trl          | >=0.13,<0.16         | 0.14.0      | RLHF                                      |
+| deepspeed    | >=0.14 |  | Training                                  |
 | vllm         | >=0.5.1              | 0.6.5       | Inference/Deployment/Evaluation           |
 | lmdeploy     | lmdeploy>=0.5,<0.6.5 | 0.6.4       | Inference/Deployment/Evaluation           |
-| deepspeed    | >=0.14 |  | Training                                  |
+| evalscope | | >=0.11 | Evaluation |
 
 For more optional dependencies, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh).
 
 
@@ -381,6 +381,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
 - reward_weights: Weights for each reward function. Must match the number of reward functions. If `None`, all rewards are weighted equally with weight `1.0`.
   - Note: If `--reward_model` is included in GRPO training, it is added to the end of the reward functions.
 - log_completions: Whether to log the model-generated content during training, to be used in conjunction with `--report_to wandb`, default is False.
+  - Note: If `--report_to wandb` is not set, a `completions.jsonl` will be created in the checkpoint to store the generated content.
 - use_vllm: Whether to use vLLM as the infer_backend for GRPO generation, default is False.
 - vllm_device: Set the device for vLLM deployment. For example, if deployed on card 0, use `cuda:0`; default is `auto`, which means using the last available GPU.
 - vllm_gpu_memory_utilization: vLLM passthrough parameter, default is 0.9.
@@ -456,7 +457,7 @@ Evaluation Arguments inherit from the [deployment arguments](#deployment-argumen
   - Note: By default, the evaluation will use datasets from `~/.cache/opencompass`. Specifying this parameter will directly use the data folder in the current directory.
 - temperature: Overrides the generation arguments, with a default value of 0.
 - verbose: This parameter is passed into DeployArguments when setting up local deployment and evaluation, and defaults to `False`.
-- eval_num_proc: Maximum concurrency for clients during evaluation. The default for text evaluation is 256, while for multimodal it is 16.
+- eval_num_proc: Maximum number of concurrent clients during evaluation, default is 16.
 - 🔥eval_url: The evaluation URL, for example, `http://localhost:8000/v1`. Examples can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/eval/eval_url). The default value is None, which means using local deployment for evaluation.
 
 
 
@@ -91,6 +91,7 @@ Hyperparameters
 - reward_weights: Weights for each reward function. Must match the number of reward functions. If `None`, all rewards are weighted equally with weight `1.0`.
   - Note: If `--reward_model` is included in GRPO training, it is added to the end of the reward functions.
 - log_completions: Whether to log the model-generated content during training, to be used in conjunction with `--report_to wandb`, default is False.
+  - Note: If `--report_to wandb` is not set, a `completions.jsonl` will be created in the checkpoint to store the generated content.
 - use_vllm: Whether to use vLLM as the back-end for sampling generation; default is False, using it is recommended to speed up training.
 - vllm_device: Device for deploying vLLM, default is auto, meaning the first unused GPU. Use cuda:x to specify a particular card.
 - vllm_gpu_memory_utilization: vLLM pass-through parameter.
 
@@ -6,6 +6,6 @@ pip install autoawq -U --no-deps
 pip install auto_gptq optimum bitsandbytes -U
 pip install git+https://github.com/modelscope/ms-swift.git#egg=ms-swift[all]
 pip install timm -U
-pip install deepspeed==0.14.* -U
+pip install deepspeed -U
 pip install qwen_vl_utils decord librosa pyav icecream -U
 # flash-attn: https://github.com/Dao-AILab/flash-attention/releases
@@ -4,8 +4,6 @@
 from dataclasses import dataclass, field
 from typing import Dict, List, Literal, Optional, Union
 
-import json
-
 from swift.utils import get_logger
 from .base_args import to_abspath
 from .deploy_args import DeployArguments
@@ -37,7 +35,7 @@ class EvalArguments(DeployArguments):
 
     temperature: Optional[float] = 0.
     verbose: bool = False
-    eval_num_proc: Optional[int] = None
+    eval_num_proc: int = 16
     # If eval_url is set, ms-swift will not perform deployment operations and
     # will directly use the URL for evaluation.
     eval_url: Optional[str] = None
@@ -47,15 +45,11 @@ def _init_eval_url(self):
         if self.eval_url and 'chat/completions' in self.eval_url:
             self.eval_url = self.eval_url.split('/chat/completions', 1)[0]
 
-    def _init_dataset_args(self):
-        if isinstance(self.dataset_args, str):
-            self.dataset_args = json.loads(self.dataset_args)
-
     def __post_init__(self):
         super().__post_init__()
         self._init_eval_url()
         self._init_eval_dataset()
-        self._init_dataset_args()
+        self.dataset_args = self.parse_to_dict(self.dataset_args)
         self.eval_output_dir = to_abspath(self.eval_output_dir)
         logger.info(f'eval_output_dir: {self.eval_output_dir}')