Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
4316425
wip
hjh0119 Aug 29, 2025
5d46eae
init wip
hjh0119 Sep 1, 2025
5828229
args wip
hjh0119 Sep 1, 2025
a82cec4
Merge remote-tracking branch 'origin/main' into mega-grpo
hjh0119 Sep 2, 2025
0689b76
reuse _prepare_rollout_engine
hjh0119 Sep 3, 2025
46593cf
merge main
hjh0119 Sep 11, 2025
3da8756
mega wip
hjh0119 Sep 12, 2025
2ca7ac1
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 17, 2025
d9ec029
wip
hjh0119 Sep 17, 2025
7c56f9f
override train_step wip
hjh0119 Sep 17, 2025
686fc74
remove override train_step to grpo
hjh0119 Sep 18, 2025
095bcbd
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 18, 2025
4d9457b
sync weight wip
hjh0119 Sep 18, 2025
f52d5e1
rollout wip
hjh0119 Sep 19, 2025
155d4fb
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 22, 2025
3c69c39
modify mini_batch_size to generation batch size
hjh0119 Sep 22, 2025
eebdd47
wip
hjh0119 Sep 24, 2025
de6ecfe
loss wip
hjh0119 Sep 28, 2025
4569e54
fix repeat n
hjh0119 Sep 28, 2025
f118935
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 29, 2025
9cb84e3
fix padding to multiple of tp_size
hjh0119 Sep 29, 2025
8627aa3
compute loss
hjh0119 Sep 29, 2025
2292cf8
fix logps
hjh0119 Sep 30, 2025
bbe5f39
logging & patch VL
hjh0119 Sep 30, 2025
6a2940c
fix rollout_group & rollout judgement
hjh0119 Oct 1, 2025
486c3d4
fix step
hjh0119 Oct 6, 2025
7e8e6b0
merge main
hjh0119 Oct 6, 2025
c68d976
move old base trainer to newer
hjh0119 Oct 7, 2025
6b1653c
fix
hjh0119 Oct 8, 2025
d4a9dcc
offload utils
hjh0119 Oct 8, 2025
9dc92a0
offload context
hjh0119 Oct 9, 2025
7bc3d61
Resolve merge conflict in megatron_args.py by removing duplicate fiel…
hjh0119 Oct 9, 2025
91f97ca
fix resolve
hjh0119 Oct 9, 2025
59f436c
fix logps
hjh0119 Oct 9, 2025
8dea6d7
fix old logps
hjh0119 Oct 9, 2025
abac696
reduce redundancy
hjh0119 Oct 9, 2025
3a3ff37
replace token
hjh0119 Oct 10, 2025
2cd89dc
fix offload model
hjh0119 Oct 10, 2025
50d5e6f
offload optimizer & ref
hjh0119 Oct 11, 2025
e1a06c6
support cp
hjh0119 Oct 11, 2025
ff9b667
fix pp+cp
hjh0119 Oct 11, 2025
ba4bfbf
lora wip
hjh0119 Oct 11, 2025
e5a6252
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Oct 13, 2025
e22c790
arguments document
hjh0119 Oct 13, 2025
b3de262
wip lora&cp
hjh0119 Oct 14, 2025
d5bd92c
merge origin
hjh0119 Oct 14, 2025
fe3270f
remove unused patch
hjh0119 Oct 14, 2025
137704e
merge main
hjh0119 Oct 29, 2025
ca9c9bc
wip server
hjh0119 Oct 29, 2025
f258202
wip
hjh0119 Oct 29, 2025
85a035e
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Oct 29, 2025
0a38c0c
server rollout wip
hjh0119 Oct 30, 2025
e0fc2e9
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Nov 4, 2025
5f2f349
move vllm client init out of args
hjh0119 Nov 4, 2025
416feb2
server mode
hjh0119 Nov 4, 2025
85135bb
merge main
hjh0119 Nov 4, 2025
b93c031
remove old func
hjh0119 Nov 4, 2025
2f5d7b5
mcore bridge
hjh0119 Nov 4, 2025
edf3378
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Nov 4, 2025
b3b37ce
merge main & flatten weight sync
hjh0119 Nov 5, 2025
1d930d8
dynamic sample
hjh0119 Nov 5, 2025
5f9e14a
fix dynamic sampling
hjh0119 Nov 5, 2025
b753911
merge main
hjh0119 Nov 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ importance_weights = torch.exp(log_importance_weights)
- `importance_sampling_level sequence` (GSPO)
- `importance_sampling_level sequence_token` (GSPO-token)

其中 sequence_token 要求 ms-swift > 3.7 (源码安装)
其中 sequence_token 要求 ms-swift >= 3.8

论文其他超参
```bash
Expand Down
11 changes: 11 additions & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,15 @@ reward模型参数将在PPO、GRPO中使用。

#### GRPO参数
- beta: KL正则系数,默认为0.04,设置为0时不加载ref model。
- epsilon: clip 系数,默认为0.2。
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
- dynamic_sample:筛除group内奖励标准差为0的数据,额外采样新数据,默认为False。
- max_resample_times:dynamic_sample设置下限制重采样次数,默认3次。
- top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算,默认为1.0,即不过滤低熵 token,具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
- log_entropy: 记录训练中的熵值变化动态,默认为False,具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 、 `sequence` 和 `sequence_token`,默认为`token`。具体参考[GSPO文档](./GRPO/AdvancedResearch/GSPO.md)
- per_device_train_batch_size: 每个设备训练批量大小,在GRPO中,指 completion 的批次大小。
- per_device_eval_batch_size: 每个设备评估批量大小,在GRPO中,指 completion 的批次大小。
- generation_batch_size: 采样completion批量大小,需要是 num_processes * per_device_train_batch_size 的倍数,默认等于 per_device_batch_size * gradient_accumulation_steps * num_processes
Expand Down Expand Up @@ -600,6 +609,8 @@ reward模型参数将在PPO、GRPO中使用。
- top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算,默认为1.0,即不过滤低熵 token,具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
- log_entropy: 记录训练中的熵值变化动态,默认为False,具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)

##### 奖励函数参数
内置的奖励函数参考[文档](./GRPO/DeveloperGuide/奖励函数.md)
cosine 奖励参数
- cosine_min_len_value_wrong:cosine 奖励函数参数,生成错误答案时,最小长度对应的奖励值。默认值为-0.5。
- cosine_max_len_value_wrong:生成错误答案时,最大长度对应的奖励值。默认值为0.0。
Expand Down
31 changes: 30 additions & 1 deletion docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ lora训练:


**DPO参数**:
- ref_load: ref_model的加载路径。采用DPO/KTO算法且使用全参数训练时需要传入。默认为None,即设置为`load`。
- ref_load: ref_model的加载路径。采用DPO/GRPO/KTO算法且使用全参数训练时需要传入。默认为None,即设置为`load`。
- ref_adapter_load: 加载ref_adapter的权重路径,默认为None。若你要使用SFT产生的LoRA权重进行DPO,请使用"ms-swift>=3.8",并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训,则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`。
- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高,表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo"),beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
- 🔥rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数,用于控制损失函数中NLL项的权重(即SFT损失),`loss = dpo_loss + rpo_alpha * sft_loss`,论文中推荐设置为`1.`。默认为`None`,即默认不引入sft_loss。
Expand All @@ -262,6 +262,35 @@ lora训练:
- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`。
- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`。

**GRPO参数**
- ref_load: 含义同DPO。
- ref_adapter_load: 含义同DPO。
- beta: KL正则系数,默认为0.04,设置为0时不加载ref model。
- epsilon: clip 系数,默认为0.2。
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 、 `sequence` 和 `sequence_token`,默认为`token`。具体参考[GSPO文档](../Instruction/GRPO/AdvancedResearch/GSPO.md)
- batch size 相关参数(注意以下均为 completion-level)
- micro_batch_size: 每个device的批次大小,默认为1。
- global_batch_size: 总批次大小,等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。对应每次更新权重的训练数据大小(mini_batch_size)
- generation_batch_size: 采样批量大小,需要是global_batch_size的倍数,默认等于global_batch_size
- steps_per_generation:每轮生成的优化步数,即采样批量大小相对global_batch_size的倍数,默认为1。
- num_generations:每个prompt采样的数量,论文中的G值。采样批量大小需被num_generations 整除。默认为 8。
- reward_funcs: GRPO算法奖励函数,可选项为`accuracy`、`format`、`cosine`、`repetition`和`soft_overlong`,见swift/plugin/orm.py。你也可以在plugin中自定义自己的奖励函数。默认为`[]`。
- reward_weights: 每个奖励函数的权重。必须与奖励函数和奖励模型的总数量匹配。如果为 None,则所有奖励的权重都相等,为`1.0`。
- loss_type: loss 归一化的类型,可选项为['grpo', 'bnpo', 'dr_grpo'], 默认为'grpo', 具体查看该[pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)。
- vllm_mode 参数
- vllm_gpu_memory_utilization: vllm透传参数,默认为0.9。
- vllm_max_model_len: vllm透传参数,默认为None。
- vllm_enforce_eager: vllm透传参数,默认为False。
- vllm_limit_mm_per_prompt: vllm透传参数,默认为None。
- vllm_enable_prefix_caching: vllm透传参数,默认为True。
- sleep_level: 训练时释放 vLLM 显存,可选项为[0, 1], 默认为0,不释放
- offload_optimizer: 是否在vLLM推理时offload optimizer参数,默认为False。
- offload_model: 是否在vLLM推理时 offload 模型,默认为False。

内置奖励函数参数参考[文档](../Instruction/命令行参数.md#奖励函数参数)

**RM参数**:
- center_rewards_coefficient: 用于激励奖励模型输出均值为零的奖励的系数,具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值:0.01。

Expand Down
11 changes: 11 additions & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -553,6 +553,15 @@ The meanings of the following parameters can be referenced [here](https://huggin

#### GRPO Arguments
- beta: KL regularization coefficient; default 0.04. Setting it to 0 disables the reference model.
- epsilon: epsilon value for clipping. Default is 0.2.
- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
- top_entropy_quantile: Only tokens whose entropy ranks within the specified top quantile are included in the loss calculation. The default is 1.0, which means low-entropy tokens are not filtered. For details, refer to the [documentation](./GRPO/AdvancedResearch/entropy_mask.md).
- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
- per_device_train_batch_size: The training batch size per device. In GRPO, this refers to the batch size of completions during training.
- per_device_eval_batch_size: The evaluation batch size per device. In GRPO, this refers to the batch size of completions during evaluation.
- generation_batch_size: Batch size to use for generation. It defaults to the effective training batch size: per_device_train_batch_size * num_processes * gradient_accumulation_steps`
Expand Down Expand Up @@ -615,6 +624,8 @@ The hyperparameters for the reward function can be found in the [Built-in Reward
- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).


##### Reward function parameters
Refer to the [documentation](./GRPO/DeveloperGuide/reward_function.md) for built-in reward functions.

cosine reward function arguments
- cosine_min_len_value_wrong (default: -0.5): Reward value corresponding to the minimum length when the answer is incorrect.
Expand Down
35 changes: 34 additions & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ LoRA Training:
- use_rslora: Default is `False`. Whether to use `RS-LoRA`.

**DPO Parameters**
- ref_load: The loading path for the reference model. This must be provided when using DPO/KTO algorithms with full-parameter training. Defaults to `None`, which means it will be set to the same value as `load`.
- ref_load: The loading path for the reference model. This must be provided when using DPO/GRPO/KTO algorithms with full-parameter training. Defaults to `None`, which means it will be set to the same value as `load`.
- ref_adapter_load: The path to load the ref_adapter weights, default is `None`. If you want to use LoRA weights generated from SFT for DPO, please use "ms-swift>=3.8" and set `--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true` during training. For resuming training from a checkpoint in this scenario, set `--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`.
- beta: Has the same meaning as in [TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig). It controls the degree of deviation from the reference model. A higher beta value indicates less deviation from the reference model. For the IPO loss function (`loss_type="ipo"`), beta is the regularization parameter as mentioned in the [paper](https://huggingface.co/papers/2310.12036). Default is 0.1.
- 🔥rpo_alpha: A parameter from the [RPO paper](https://huggingface.co/papers/2404.19733) that controls the weight of the NLL term (i.e., the SFT loss) in the loss function, where `loss = dpo_loss + rpo_alpha * sft_loss`. The paper recommends setting it to `1.`. The default value is `None`, meaning the SFT loss is not included by default.
Expand All @@ -280,6 +280,39 @@ LoRA Training:
**RM Parameters**:
- center_rewards_coefficient: A coefficient used in reward model (RM) training to incentivize the model to output rewards with zero mean. See this [paper](https://huggingface.co/papers/2312.09244) for details. Recommended value: 0.01.

**GRPO Parameters**
- ref_load: Same meaning as in DPO.
- ref_adapter_load: Same meaning as in DPO.
- beta: KL regularization coefficient, default is 0.04. When set to 0, the reference model is not loaded.
- epsilon: Clip coefficient, default is 0.2.
- epsilon_high: Upper clip coefficient, default is None. When set, forms a clipping range [epsilon, epsilon_high] together with epsilon.
- overlong_filter: Skips samples that are truncated due to excessive length and excludes them from loss computation. Default is False.
- importance_sampling_level: Controls the level at which importance sampling ratios are computed. Options are `token`, `sequence`, and `sequence_token`. Default is `token`. See [GSPO Documentation](../Instruction/GRPO/AdvancedResearch/GSPO.md) for details.
- Batch Size Related Parameters (Note: all are completion-level)
- micro_batch_size: Batch size per device, default is 1.
- global_batch_size: Total batch size, equivalent to `micro_batch_size * data parallelism size * gradient accumulation steps`. Default is 16. Corresponds to the mini_batch_size (number of training samples per weight update).
- generation_batch_size: Sampling batch size, must be a multiple of global_batch_size. Default equals global_batch_size.
- steps_per_generation: Number of optimization steps per generation round, i.e., the ratio of generation_batch_size to global_batch_size. Default is 1.
- num_generations: Number of samples generated per prompt (the "G" value in the paper). generation_batch_size must be divisible by num_generations. Default is 8.
- reward_funcs: Reward functions used in GRPO algorithm. Options include `accuracy`, `format`, `cosine`, `repetition`, and `soft_overlong`, defined in swift/plugin/orm.py. You can also customize your own reward functions in the plugin. Default is `[]`.
- reward_weights: Weights assigned to each reward function. Must match the total number of reward functions and reward models. If None, all rewards are equally weighted with `1.0`.
- loss_type: Type of loss normalization. Options are ['grpo', 'bnpo', 'dr_grpo']. Default is 'grpo'. See this [PR](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348) for details.

- vLLM Parameters
- vllm_gpu_memory_utilization: Pass-through parameter to vLLM, default is 0.9.
- vllm_max_model_len: Pass-through parameter to vLLM, default is None.
- vllm_enforce_eager: Pass-through parameter to vLLM, default is False.
- vllm_limit_mm_per_prompt: Pass-through parameter to vLLM, default is None.
- vllm_enable_prefix_caching: Pass-through parameter to vLLM, default is True.
- sleep_level: Release vLLM GPU memory during training. Options are [0, 1], default is 0 (no release).
- offload_optimizer: Whether to offload optimizer states during vLLM inference. Default is False.
- offload_model: Whether to offload model weights during vLLM inference. Default is False.

For built-in reward function parameters, refer to the [documentation](../Instruction/GRPO/DeveloperGuide/reward_function.md).

**RM Parameters**:
- center_rewards_coefficient: A coefficient used in reward model (RM) training to incentivize the model to output rewards with zero mean. See this [paper](https://huggingface.co/papers/2312.09244) for details. Recommended value: 0.01.

**Mcore-Bridge Parameters**

- 🔥load_safetensors: Defaults to False. Whether to load weights directly from safetensors.
Expand Down
2 changes: 2 additions & 0 deletions swift/llm/template/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1275,6 +1275,8 @@ def _handle_megatron_cp(self, encoded: Dict[str, Any]) -> None:
cp_size = self.sequence_parallel_size
if not self.use_megatron or cp_size == 1:
return
if self.mode == 'vllm': # skip for megatron grpo rollout
return
input_ids = encoded['input_ids']
padding_len = math.ceil(len(input_ids) / (cp_size * 2)) * (cp_size * 2) - len(input_ids)
input_ids += [self.tokenizer.pad_token_id] * padding_len
Expand Down
Loading
Loading