modelscope · hjh0119 · Aug 29, 2025 · Sep 1, 2025 · Sep 1, 2025 · Sep 2, 2025
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md b/docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md
@@ -54,7 +54,7 @@ importance_weights = torch.exp(log_importance_weights)
 - `importance_sampling_level sequence` （GSPO）
 - `importance_sampling_level sequence_token` （GSPO-token）
 
-其中 sequence_token 要求 ms-swift > 3.7 （源码安装）
+其中 sequence_token 要求 ms-swift >= 3.8
 
 论文其他超参
 ```bash

diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -541,6 +541,15 @@ reward模型参数将在PPO、GRPO中使用。
 
 #### GRPO参数
 - beta: KL正则系数，默认为0.04，设置为0时不加载ref model。
+- epsilon: clip 系数，默认为0.2。
+- epsilon_high: upper clip 系数，默认为None，设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
+- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置，建议大于 1 + epsilon。默认为None。
+- overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
+- dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
+- max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
+- top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算，默认为1.0，即不过滤低熵 token，具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
+- log_entropy: 记录训练中的熵值变化动态，默认为False，具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
+- importance_sampling_level: 控制重要性采样比计算，可选项为 `token` 、 `sequence` 和 `sequence_token`，默认为`token`。具体参考[GSPO文档](./GRPO/AdvancedResearch/GSPO.md)
 - per_device_train_batch_size: 每个设备训练批量大小，在GRPO中，指 completion 的批次大小。
 - per_device_eval_batch_size: 每个设备评估批量大小，在GRPO中，指 completion 的批次大小。
 - generation_batch_size: 采样completion批量大小，需要是 num_processes * per_device_train_batch_size 的倍数，默认等于 per_device_batch_size * gradient_accumulation_steps * num_processes
@@ -600,6 +609,8 @@ reward模型参数将在PPO、GRPO中使用。
 - top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算，默认为1.0，即不过滤低熵 token，具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
 - log_entropy: 记录训练中的熵值变化动态，默认为False，具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
 
+##### 奖励函数参数
+内置的奖励函数参考[文档](./GRPO/DeveloperGuide/奖励函数.md)
 cosine 奖励参数
 - cosine_min_len_value_wrong：cosine 奖励函数参数，生成错误答案时，最小长度对应的奖励值。默认值为-0.5。
 - cosine_max_len_value_wrong：生成错误答案时，最大长度对应的奖励值。默认值为0.0。

diff --git a/docs/source/Megatron-SWIFT/命令行参数.md b/docs/source/Megatron-SWIFT/命令行参数.md
@@ -244,7 +244,7 @@ lora训练：
 
 
 **DPO参数**:
-- ref_load: ref_model的加载路径。采用DPO/KTO算法且使用全参数训练时需要传入。默认为None，即设置为`load`。
+- ref_load: ref_model的加载路径。采用DPO/GRPO/KTO算法且使用全参数训练时需要传入。默认为None，即设置为`load`。
 - ref_adapter_load: 加载ref_adapter的权重路径，默认为None。若你要使用SFT产生的LoRA权重进行DPO，请使用"ms-swift>=3.8"，并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训，则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`。
 - beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高，表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo")，beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
 - 🔥rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数，用于控制损失函数中NLL项的权重（即SFT损失），`loss = dpo_loss + rpo_alpha * sft_loss`，论文中推荐设置为`1.`。默认为`None`，即默认不引入sft_loss。
@@ -262,6 +262,35 @@ lora训练：
 - desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 desirable 损失按该系数进行加权，默认为`1.`。
 - undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 undesirable 损失按该系数进行加权，默认为`1.`。
 
+**GRPO参数**
+- ref_load: 含义同DPO。
+- ref_adapter_load: 含义同DPO。
+- beta: KL正则系数，默认为0.04，设置为0时不加载ref model。
+- epsilon: clip 系数，默认为0.2。
+- epsilon_high: upper clip 系数，默认为None，设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
+- overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
+- importance_sampling_level: 控制重要性采样比计算，可选项为 `token` 、 `sequence` 和 `sequence_token`，默认为`token`。具体参考[GSPO文档](../Instruction/GRPO/AdvancedResearch/GSPO.md)
+- batch size 相关参数(注意以下均为 completion-level)
+  - micro_batch_size: 每个device的批次大小，默认为1。
+  - global_batch_size: 总批次大小，等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。对应每次更新权重的训练数据大小（mini_batch_size）
+  - generation_batch_size: 采样批量大小，需要是global_batch_size的倍数，默认等于global_batch_size
+  - steps_per_generation：每轮生成的优化步数，即采样批量大小相对global_batch_size的倍数，默认为1。
+  - num_generations：每个prompt采样的数量，论文中的G值。采样批量大小需被num_generations 整除。默认为 8。
+- reward_funcs: GRPO算法奖励函数，可选项为`accuracy`、`format`、`cosine`、`repetition`和`soft_overlong`，见swift/plugin/orm.py。你也可以在plugin中自定义自己的奖励函数。默认为`[]`。
+- reward_weights: 每个奖励函数的权重。必须与奖励函数和奖励模型的总数量匹配。如果为 None，则所有奖励的权重都相等，为`1.0`。
+- loss_type: loss 归一化的类型，可选项为['grpo', 'bnpo', 'dr_grpo'], 默认为'grpo', 具体查看该[pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)。
+- vllm_mode 参数
+  - vllm_gpu_memory_utilization: vllm透传参数，默认为0.9。
+  - vllm_max_model_len: vllm透传参数，默认为None。
+  - vllm_enforce_eager: vllm透传参数，默认为False。
+  - vllm_limit_mm_per_prompt: vllm透传参数，默认为None。
+  - vllm_enable_prefix_caching: vllm透传参数，默认为True。
+  - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放
+  - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
+  - offload_model: 是否在vLLM推理时 offload 模型，默认为False。
+
+内置奖励函数参数参考[文档](../Instruction/命令行参数.md#奖励函数参数)
+
 **RM参数**:
 - center_rewards_coefficient: 用于激励奖励模型输出均值为零的奖励的系数，具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值：0.01。
 

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -553,6 +553,15 @@ The meanings of the following parameters can be referenced [here](https://huggin
 
 #### GRPO Arguments
 - beta: KL regularization coefficient; default 0.04. Setting it to 0 disables the reference model.
+- epsilon: epsilon value for clipping. Default is 0.2.
+- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
+- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
+- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
+- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
+- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
+- top_entropy_quantile: Only tokens whose entropy ranks within the specified top quantile are included in the loss calculation. The default is 1.0, which means low-entropy tokens are not filtered. For details, refer to the [documentation](./GRPO/AdvancedResearch/entropy_mask.md).
+- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
+- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
 - per_device_train_batch_size: The training batch size per device. In GRPO, this refers to the batch size of completions during training.
 - per_device_eval_batch_size: The evaluation batch size per device. In GRPO, this refers to the batch size of completions during evaluation.
 - generation_batch_size: Batch size to use for generation. It defaults to the effective training batch size: per_device_train_batch_size * num_processes * gradient_accumulation_steps`
@@ -615,6 +624,8 @@ The hyperparameters for the reward function can be found in the [Built-in Reward
 - log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
 
 
+##### Reward function parameters
+Refer to the [documentation](./GRPO/DeveloperGuide/reward_function.md) for built-in reward functions.
 
 cosine reward function arguments
 - cosine_min_len_value_wrong (default: -0.5): Reward value corresponding to the minimum length when the answer is incorrect.

diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -259,7 +259,7 @@ LoRA Training:
 - use_rslora: Default is `False`. Whether to use `RS-LoRA`.
 
 **DPO Parameters**
-- ref_load: The loading path for the reference model. This must be provided when using DPO/KTO algorithms with full-parameter training. Defaults to `None`, which means it will be set to the same value as `load`.
+- ref_load: The loading path for the reference model. This must be provided when using DPO/GRPO/KTO algorithms with full-parameter training. Defaults to `None`, which means it will be set to the same value as `load`.
 - ref_adapter_load: The path to load the ref_adapter weights, default is `None`. If you want to use LoRA weights generated from SFT for DPO, please use "ms-swift>=3.8" and set `--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true` during training. For resuming training from a checkpoint in this scenario, set `--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`.
 - beta: Has the same meaning as in [TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig). It controls the degree of deviation from the reference model. A higher beta value indicates less deviation from the reference model. For the IPO loss function (`loss_type="ipo"`), beta is the regularization parameter as mentioned in the [paper](https://huggingface.co/papers/2310.12036). Default is 0.1.
 - 🔥rpo_alpha: A parameter from the [RPO paper](https://huggingface.co/papers/2404.19733) that controls the weight of the NLL term (i.e., the SFT loss) in the loss function, where `loss = dpo_loss + rpo_alpha * sft_loss`. The paper recommends setting it to `1.`. The default value is `None`, meaning the SFT loss is not included by default.
@@ -280,6 +280,39 @@ LoRA Training:
 **RM Parameters**:
 - center_rewards_coefficient: A coefficient used in reward model (RM) training to incentivize the model to output rewards with zero mean. See this [paper](https://huggingface.co/papers/2312.09244) for details. Recommended value: 0.01.
 
+**GRPO Parameters**
+- ref_load: Same meaning as in DPO.
+- ref_adapter_load: Same meaning as in DPO.
+- beta: KL regularization coefficient, default is 0.04. When set to 0, the reference model is not loaded.
+- epsilon: Clip coefficient, default is 0.2.
+- epsilon_high: Upper clip coefficient, default is None. When set, forms a clipping range [epsilon, epsilon_high] together with epsilon.
+- overlong_filter: Skips samples that are truncated due to excessive length and excludes them from loss computation. Default is False.
+- importance_sampling_level: Controls the level at which importance sampling ratios are computed. Options are `token`, `sequence`, and `sequence_token`. Default is `token`. See [GSPO Documentation](../Instruction/GRPO/AdvancedResearch/GSPO.md) for details.
+- Batch Size Related Parameters (Note: all are completion-level)
+  - micro_batch_size: Batch size per device, default is 1.
+  - global_batch_size: Total batch size, equivalent to `micro_batch_size * data parallelism size * gradient accumulation steps`. Default is 16. Corresponds to the mini_batch_size (number of training samples per weight update).
+  - generation_batch_size: Sampling batch size, must be a multiple of global_batch_size. Default equals global_batch_size.
+  - steps_per_generation: Number of optimization steps per generation round, i.e., the ratio of generation_batch_size to global_batch_size. Default is 1.
+  - num_generations: Number of samples generated per prompt (the "G" value in the paper). generation_batch_size must be divisible by num_generations. Default is 8.
+- reward_funcs: Reward functions used in GRPO algorithm. Options include `accuracy`, `format`, `cosine`, `repetition`, and `soft_overlong`, defined in swift/plugin/orm.py. You can also customize your own reward functions in the plugin. Default is `[]`.
+- reward_weights: Weights assigned to each reward function. Must match the total number of reward functions and reward models. If None, all rewards are equally weighted with `1.0`.
+- loss_type: Type of loss normalization. Options are ['grpo', 'bnpo', 'dr_grpo']. Default is 'grpo'. See this [PR](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348) for details.
+
+- vLLM Parameters
+  - vllm_gpu_memory_utilization: Pass-through parameter to vLLM, default is 0.9.
+  - vllm_max_model_len: Pass-through parameter to vLLM, default is None.
+  - vllm_enforce_eager: Pass-through parameter to vLLM, default is False.
+  - vllm_limit_mm_per_prompt: Pass-through parameter to vLLM, default is None.
+  - vllm_enable_prefix_caching: Pass-through parameter to vLLM, default is True.
+  - sleep_level: Release vLLM GPU memory during training. Options are [0, 1], default is 0 (no release).
+  - offload_optimizer: Whether to offload optimizer states during vLLM inference. Default is False.
+  - offload_model: Whether to offload model weights during vLLM inference. Default is False.
+
+For built-in reward function parameters, refer to the [documentation](../Instruction/GRPO/DeveloperGuide/reward_function.md).
+
+**RM Parameters**:
+- center_rewards_coefficient: A coefficient used in reward model (RM) training to incentivize the model to output rewards with zero mean. See this [paper](https://huggingface.co/papers/2312.09244) for details. Recommended value: 0.01.
+
 **Mcore-Bridge Parameters**
 
 - 🔥load_safetensors: Defaults to False. Whether to load weights directly from safetensors.

diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py
@@ -1275,6 +1275,8 @@ def _handle_megatron_cp(self, encoded: Dict[str, Any]) -> None:
         cp_size = self.sequence_parallel_size
         if not self.use_megatron or cp_size == 1:
             return
+        if self.mode == 'vllm':  # skip for megatron grpo rollout
+            return
         input_ids = encoded['input_ids']
         padding_len = math.ceil(len(input_ids) / (cp_size * 2)) * (cp_size * 2) - len(input_ids)
         input_ids += [self.tokenizer.pad_token_id] * padding_len