Skip to content

Commit 29d2bb0

Browse files
authored
fix grpo zero3 (#3104)
1 parent 43ee77c commit 29d2bb0

File tree

14 files changed

+19
-11
lines changed

14 files changed

+19
-11
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ Running Environment:
117117
| trl | >=0.13,<0.16 | 0.14.0 | RLHF |
118118
| vllm | >=0.5.1 | 0.6.5 | Inference/Deployment/Evaluation |
119119
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 | Inference/Deployment/Evaluation |
120-
| deepspeed | | 0.14.5 | Training |
120+
| deepspeed | >=0.14 | | Training |
121121

122122
For more optional dependencies, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh).
123123

README_CN.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ pip install -e .
112112
| trl | >=0.13,<0.16 | 0.14.0 |RLHF|
113113
| vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
114114
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
115-
| deepspeed | | 0.14.5 |训练|
115+
| deepspeed | >=0.14 | |训练|
116116

117117
更多可选依赖可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh)
118118

docs/source/GetStarted/SWIFT安装.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ pip install ms-swift==2.*
6363
| trl | >=0.13,<0.16 | 0.14.0 |RLHF|
6464
| vllm | >=0.5.1 | 0.6.5 |推理/部署/评测|
6565
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
66-
| deepspeed | | 0.14.5 |训练|
66+
| deepspeed | >=0.14 | |训练|
6767

6868
更多可选依赖可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh)
6969

docs/source/Instruction/GRPO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass
8484
超参数
8585
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_eval_batch_size * nproc_per_node 整除
8686
- max_completion_length: 采样生成的最大长度,默认为512
87+
- ds3_gather_for_generation: 该参数适用于DeepSpeed ZeRO-3。如果启用,策略模型权重将被收集用于生成,从而提高生成速度。然而,禁用此选项允许训练超出单个GPU VRAM的模型,尽管生成速度会变慢。禁用此选项与vLLM生成不兼容。默认为True
8788
- reward_funcs: 奖励函数,根据模型生成结果进行打分,内置accuracy、format、cosine和repetition四个rule-based函数,详细见 swift/plugin/orm.py 文件
8889
- reward_weights: 每个奖励函数的权重。必须与奖励函数的数量匹配。如果为 None,则所有奖励的权重都相等,为`1.0`
8990
- 提示:如果GRPO训练中包含`--reward_model`,则其加在奖励函数的最后位置

docs/source/Instruction/命令行参数.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,7 @@ reward模型参数将在PPO、GRPO中使用。
365365
#### GRPO参数
366366
- num_generations: GRPO算法中的G值,默认为8
367367
- max_completion_length: GRPO算法中的最大生成长度,默认为512
368+
- ds3_gather_for_generation: 该参数适用于DeepSpeed ZeRO-3。如果启用,策略模型权重将被收集用于生成,从而提高生成速度。然而,禁用此选项允许训练超出单个GPU VRAM的模型,尽管生成速度会变慢。禁用此选项与vLLM生成不兼容。默认为True
368369
- reward_funcs: GRPO算法奖励函数,可选项为`accuracy``format``cosine``repetition`,见swift/plugin/orm.py。你也可以在plugin中自定义自己的奖励函数。默认为`[]`
369370
- reward_weights: 每个奖励函数的权重。必须与奖励函数的数量匹配。如果为 None,则所有奖励的权重都相等,为`1.0`
370371
- 提示:如果GRPO训练中包含`--reward_model`,则其加在奖励函数的最后位置

docs/source/Instruction/预训练与微调.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
pip install ms-swift -U
2525

2626
# 若使用deepspeed zero2/zero3
27-
pip install deepspeed==0.14.5
27+
pip install deepspeed -U
2828
```
2929

3030
## 预训练
@@ -73,7 +73,7 @@ ms-swift使用了分层式的设计思想,用户可以使用命令行界面、
7373
- 无法对QLoRA训练的模型进行Merge LoRA,因此不建议使用QLoRA进行微调,无法在推理和部署时使用vLLM/LMDeploy进行推理加速。建议使用LoRA/全参数进行微调,合并为完整权重后再使用GPTQ/AWQ/BNB进行[量化](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize)
7474
- SWIFT默认在训练时设置`--gradient_checkpointing true`来节约显存,这会略微降低训练速度。
7575
- 若使用DDP进行训练,出现报错:`RuntimeError: Expected to mark a variable ready only once.`,请额外设置参数`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`或者使用DeepSpeed进行训练。
76-
- 如果要使用deepspeed,你需要安装deepspeed:`pip install deepspeed==0.14.5`。使用deepspeed可以节约显存,但会略微降低训练速度。
76+
- 如果要使用deepspeed,你需要安装deepspeed:`pip install deepspeed -U`。使用deepspeed可以节约显存,但会略微降低训练速度。
7777
- 如果您的机器是A100等高性能显卡,且模型支持flash-attn,推荐你安装[flash-attn](https://github.com/Dao-AILab/flash-attention/releases),并设置`--attn_impl flash_attn`,这将会加快训练和推理的速度并略微降低显存占用。
7878

7979
**如何debug:**

docs/source_en/GetStarted/SWIFT-installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ You can view the image [here](https://modelscope.cn/docs/intro/environment-setup
6464
| trl | >=0.13,<0.16 | 0.14.0 | RLHF |
6565
| vllm | >=0.5.1 | 0.6.5 | Inference/Deployment/Evaluation |
6666
| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 | Inference/Deployment/Evaluation |
67-
| deepspeed | | 0.14.5 | Training |
67+
| deepspeed | >=0.14 | | Training |
6868

6969
For more optional dependencies, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh).
7070

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
376376
#### GRPO Arguments
377377
- num_generations: The G value in the GRPO algorithm, default is 8.
378378
- max_completion_length: The maximum generation length in the GRPO algorithm, default is 512.
379+
- ds3_gather_for_generation: This parameter applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation, improving generation speed. However, disabling this option allows training models that exceed the VRAM capacity of a single GPU, albeit at the cost of slower generation. Disabling this option is not compatible with vLLM generation. The default is True.
379380
- reward_funcs: Reward functions in the GRPO algorithm; options include `accuracy`,`format`,`cosine` and `repetition`, as seen in `swift/plugin/orm.py`. You can also customize your own reward functions in the plugin. Default is `[]`.
380381
- reward_weights: Weights for each reward function. Must match the number of reward functions. If `None`, all rewards are weighted equally with weight `1.0`.
381382
- Note: If `--reward_model` is included in GRPO training, it is added to the end of the reward functions.

docs/source_en/Instruction/GRPO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ Hyperparameters
8686

8787
- num_generations: The number of samples for each prompt, referred to as the G value in the paper, needs to be divisible by per_device_eval_batch_size * - nproc_per_node.
8888
- max_completion_length: The maximum length for sampling generation, default is 512.
89+
- ds3_gather_for_generation: This parameter applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation, improving generation speed. However, disabling this option allows training models that exceed the VRAM capacity of a single GPU, albeit at the cost of slower generation. Disabling this option is not compatible with vLLM generation. The default is True.
8990
- reward_funcs: Reward functions to score the results generated by the model. Includes built-in accuracy, format , cosine and repetition rule-based functions, detailed in the swift/plugin/orm.py file.
9091
- reward_weights: Weights for each reward function. Must match the number of reward functions. If `None`, all rewards are weighted equally with weight `1.0`.
9192
- Note: If `--reward_model` is included in GRPO training, it is added to the end of the reward functions.

docs/source_en/Instruction/Pre-training-and-Fine-tuning.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Refer to the [SWIFT installation documentation](../GetStarted/SWIFT-installation
2525
pip install ms-swift -U
2626

2727
# If using deepspeed zero2/zero3
28-
pip install deepspeed==0.14.5
28+
pip install deepspeed -U
2929
```
3030

3131
## Pre-training
@@ -77,7 +77,7 @@ Additionally, we offer a series of scripts to help you understand the training c
7777
- Merging LoRA for models trained with QLoRA is not possible, so it is not recommended to use QLoRA for fine-tuning, as it cannot utilize vLLM/LMDeploy for inference acceleration during inference and deployment. It is recommended to use LoRA or full parameter fine-tuning, merge them into complete weights, and then use GPTQ/AWQ/BNB for [quantization](https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize).
7878
- By default, SWIFT sets `--gradient_checkpointing true` during training to save memory, which may slightly slow down the training speed.
7979
- If you are using DDP for training and encounter the error: `RuntimeError: Expected to mark a variable ready only once.`, please additionally set the parameter `--gradient_checkpointing_kwargs '{"use_reentrant": false}'` or use DeepSpeed for training.
80-
- To use DeepSpeed, you need to install it: `pip install deepspeed==0.14.5`. Using DeepSpeed can save memory but may slightly reduce training speed.
80+
- To use DeepSpeed, you need to install it: `pip install deepspeed -U`. Using DeepSpeed can save memory but may slightly reduce training speed.
8181
- If your machine has high-performance GPUs like A100 and the model supports flash-attn, it is recommended to install [flash-attn](https://github.com/Dao-AILab/flash-attention/releases) and set `--attn_impl flash_attn`, as this will accelerate training and inference while slightly reducing memory usage.
8282

8383
**How to debug:**

swift/llm/argument/rlhf_args.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ class PPOArguments:
4242
class GRPOArguments(GRPOArgumentsMixin):
4343
num_generations: int = 8 # G in the GRPO paper
4444
max_completion_length: int = 512
45+
ds3_gather_for_generation: bool = True
4546
reward_funcs: List[str] = field(default_factory=list)
4647
reward_weights: List[float] = None
4748
log_completions: bool = False

swift/llm/sampling/vanilla_sampler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,8 @@ def __init__(self, *args, **kwargs):
3535
raise ValueError(f'Cannot find engine name: {self.args.sampler_engine}')
3636
self.infer_engine = None
3737
if _Engine:
38-
self.template = self.args.get_model_processor(model=self.args.model, load_model=False)
3938
self.infer_engine = _Engine(self.args.model, model_type=self.args.model_type, **self.args.engine_kwargs)
39+
self.infer_engine.default_template = self.template
4040
self.caches = self.read_cache()
4141

4242
def read_cache(self):

swift/trainers/rlhf_trainer/grpo_trainer.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from accelerate.utils import broadcast_object_list, gather, gather_object
1212
from transformers import PreTrainedModel
1313
from trl import GRPOTrainer as HFGRPOTrainer
14+
from trl.models import unwrap_model_for_generation
1415

1516
from swift.llm import InferRequest, RequestConfig, to_device
1617
from swift.plugin.orm import orms
@@ -201,7 +202,9 @@ def _prepare_inputs(self, inputs) -> Dict[str, Union[torch.Tensor, Any]]:
201202
is_multimodal = self.model.model_meta.is_multimodal
202203
if is_multimodal:
203204
models = self.template.remove_post_encode_hook()
204-
outputs = self.engine.infer(inputs, self.request_config, use_tqdm=False)
205+
with unwrap_model_for_generation(self.model, self.accelerator):
206+
# same reference
207+
outputs = self.engine.infer(inputs, self.request_config, use_tqdm=False)
205208
if is_multimodal:
206209
self.template.register_post_encode_hook(models)
207210

swift/trainers/rlhf_trainer/rlhf_mixin.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ def _save_load_context(trainer):
5757
finally:
5858
deepspeed_model.__dict__['module'] = _old_model
5959
deepspeed_model._modules['module'] = _old_model
60-
trainer.model = deepspeed_model
60+
trainer.model = _old_model
6161

6262

6363
class RLHFTrainerMixin:

0 commit comments

Comments
 (0)