Skip to content

Commit 7576e47

Browse files
authored
[BugFix] Fix IFEval GRPO runs (#3012)
1 parent 169fe1f commit 7576e47

File tree

8 files changed

+147
-100
lines changed

8 files changed

+147
-100
lines changed

sota-implementations/grpo/README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ for data in collector: # Data collection loop
8383
loss = loss_fn(batch)
8484
loss.backward()
8585
optimizer.step()
86-
# Weight updte
86+
# Weight update
8787
weight_updater.push_weights(policy_training)
8888
```
8989

@@ -119,8 +119,9 @@ Key differences:
119119
- Async: Each piece of data is processed a non-deterministic number of times.
120120

121121
4. **Weight updates**:
122-
- Sync: Weights are updated befor every collection of data
123-
- Async: Weights are updated at a given interval (in gradient steps)
122+
- Sync: Weights are updated befor every collection of data.
123+
- Async: Weights are updated at a given interval (in gradient steps). This will require a synchronization between the training
124+
and inference processes, and frequent updates will cause both workers to often wait for each other.
124125

125126
The async mode offers better performance by:
126127
- Running data collection and optimization concurrently

sota-implementations/grpo/config/grpo_gsm8k.yaml

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,38 @@ env:
1616

1717
# Base model configuration
1818
model:
19+
# A 3B model is sufficient for this task:
1920
name: Qwen/Qwen2.5-3B
2021
compile: false
2122

2223
# Base training configuration - will be merged with mode-specific settings
2324
train:
24-
# Fields defined in mode configs (async.yaml and sync.yaml)
25-
# mixed_precision: true # Whether to use mixed precision training
26-
# epochs: 1 # Number of training epochs
27-
# steps_per_batch: 32 # Number of steps per batch
28-
# total_dialog_turns: 1_000_000 # Total number of dialog turns to collect
29-
# optim_batch_size: 2 # Batch size for optimization
30-
# gradient_accumulation_steps: 1 # Number of gradient accumulation steps
31-
# kl_coef_in_loss: true # Whether to include KL coefficient in loss
32-
# sync: false # Default to async, will be overridden by mode configs
33-
# buffer_size: 128 # Size of replay buffer
25+
# Some fields are defined in mode configs (async.yaml and sync.yaml)
26+
# The following fields are task-specific:
3427
exp_name: "grpo-gsm8k"
3528

29+
# Whether to use mixed precision training.
30+
mixed_precision: true
31+
32+
# Total number of dialog turns to collect during training.
33+
total_dialog_turns: 100_000
34+
35+
# Number of steps in each batch. Higher values will cause the inference step to be slower, but won't use more GPU memory.
36+
steps_per_batch: 32
37+
38+
# Number of gradient accumulation steps. Higher values will use less GPU memory (comparing with bigger batches and lower gradient_accumulation_steps),
39+
# but will make the optimization step slower.
40+
gradient_accumulation_steps: 1
41+
3642
# Fields used by both scripts but with different semantics
3743
checkpoint_frequency: 100 # Save checkpoint every N steps/batches
3844

45+
# Batch size for optimization. Higher values will use more GPU memory.
46+
optim_batch_size: 1
47+
48+
# Whether to include the KL coefficient in the loss function. Alternatively, the KL ref-to-train will be added to the reward.
49+
kl_coef_in_loss: true
50+
3951
# KL coefficients for the KL divergence to the reference and inference policies
4052
kl_to_ref_coeff: 1e-2
4153
kl_to_inference_coeff: 0.0

sota-implementations/grpo/config/grpo_ifeval.yaml

Lines changed: 31 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -10,44 +10,56 @@ env:
1010
dataset: ifeval # choices: [gsm8k, ifeval]
1111
# Number of environments to run in parallel. This determines the batch size passed to vLLM.
1212
# More envs consume more GPU memory.
13-
num_envs: 2
13+
num_envs: 4
1414
# Number of times to repeat the same prompt for GRPO. This does not affect the GPU memory usage.
1515
repeats: 16
1616

1717
# Base model configuration
1818
model:
19-
name: Qwen/Qwen2.5-3B
19+
# A 7B model works well for this task.
20+
name: Qwen/Qwen2.5-7b
2021
compile: false
2122

2223
# Base training configuration - will be merged with mode-specific settings
2324
train:
24-
# Fields defined in mode configs (async.yaml and sync.yaml)
25-
# mixed_precision: true # Whether to use mixed precision training
26-
# epochs: 1 # Number of training epochs
27-
# steps_per_batch: 32 # Number of steps per batch
28-
# total_dialog_turns: 1_000_000 # Total number of dialog turns to collect
29-
# optim_batch_size: 2 # Batch size for optimization
30-
# gradient_accumulation_steps: 1 # Number of gradient accumulation steps
31-
# kl_coef_in_loss: true # Whether to include KL coefficient in loss
32-
# sync: false # Default to async, will be overridden by mode configs
33-
# buffer_size: 128 # Size of replay buffer
25+
# Some fields are defined in mode configs (async.yaml and sync.yaml)
26+
# The following fields are task-specific:
3427
exp_name: "grpo-ifeval"
3528

29+
# Whether to use mixed precision training.
30+
mixed_precision: true
31+
32+
# Total number of dialog turns to collect during training.
33+
total_dialog_turns: 100_000
34+
35+
# Number of steps in each batch. Higher values will cause the inference step to be slower, but won't use more GPU memory.
36+
steps_per_batch: 64
37+
38+
# Number of gradient accumulation steps. Higher values will use less GPU memory (comparing with bigger batches and lower gradient_accumulation_steps),
39+
# but will make the optimization step slower.
40+
gradient_accumulation_steps: 4
41+
3642
# Fields used by both scripts but with different semantics
3743
checkpoint_frequency: 100 # Save checkpoint every N steps/batches
3844

45+
# Batch size for optimization. Higher values will use more GPU memory.
46+
optim_batch_size: 2
47+
48+
# Whether to include the KL coefficient in the loss function. Alternatively, the KL ref-to-train will be added to the reward.
49+
kl_coef_in_loss: false
50+
3951
# KL coefficients for the KL divergence to the reference and inference policies
40-
kl_to_ref_coeff: 1e-2
41-
kl_to_inference_coeff: 0.0
52+
kl_to_ref_coeff: 1e-1
53+
kl_to_inference_coeff: 1e-1
4254
entropy_coeff: 0.01
4355

4456
# Fields used only by grpo-async.py / grpo-sync.py
45-
logging_frequency: 10 # Log metrics every N steps
57+
logging_frequency: 1 # Log metrics every N steps - here at each optimization step
4658

4759
# Training model configuration
4860
train_model:
4961
gradient_checkpointing: true # Enabled for memory efficiency
50-
num_devices: 1 # Number of devices to use
62+
num_devices: 4 # Number of devices to use
5163
lora:
5264
enabled: true # Using LoRA for memory efficiency
5365
r: 8 # LoRA rank - controls capacity of adaptations
@@ -60,7 +72,7 @@ train_model:
6072

6173
# Inference model configuration
6274
inference_model:
63-
num_devices: 1 # Number of devices to use
75+
num_devices: 2 # Number of devices to use
6476
quantization:
6577
enabled: false # Enable 4-bit quantization for base model
6678
attn_implementation: sdpa # Using flash attention for memory efficiency
@@ -74,7 +86,7 @@ inference_model:
7486
# Reference model configuration
7587
ref_model:
7688
gradient_checkpointing: false # Always false, no backprop
77-
num_devices: 1 # Number of devices to use
89+
num_devices: 2 # Number of devices to use
7890
lora:
7991
enabled: true # Using LoRA for memory efficiency
8092
r: 8 # LoRA rank - controls capacity of adaptations
@@ -89,7 +101,7 @@ ref_model:
89101
optimizer:
90102
name: AdamW
91103
lr: 1e-5
92-
clip_grad_norm: 1.0
104+
clip_grad_norm: 10.0
93105
weight_decay: 0.0
94106

95107
# Ray configuration

sota-implementations/grpo/config/mode/async.yaml

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,9 @@ train:
33
# Mode-specific setting
44
sync: false # Force asynchronous mode
55

6-
# Shared training settings
7-
# Whether to use mixed precision training.
8-
mixed_precision: true
96
# Number of epochs to train for, every time a batch is collected. Per se, not directly used in async - aside from computing the total number of steps.
107
epochs: 1
11-
# Number of steps in each batch. Higher values will cause the inference step to be slower, but won't use more GPU memory.
12-
steps_per_batch: 16
13-
# Leave buffer_size empty to use steps_per_batch in async mode
14-
buffer_size:
15-
# Total number of dialog turns to collect during training.
16-
total_dialog_turns: 100_000
17-
# Batch size for optimization. Higher values will use more GPU memory.
18-
optim_batch_size: 1
19-
# Number of gradient accumulation steps. Higher values will use less GPU memory (comparing with bigger batches and lower gradient_accumulation_steps),
20-
# but will make the optimization step slower.
21-
gradient_accumulation_steps: 4
22-
# Whether to include the KL coefficient in the loss function. Alternatively, the KL ref-to-train will be added to the reward.
23-
kl_coef_in_loss: true
8+
# The buffer size can be controlled in async mode
9+
buffer_size: 128
2410
# Update policy weights every N steps - can be set to any positive integer in async mode
2511
weight_update_frequency: 10

sota-implementations/grpo/config/mode/sync.yaml

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,9 @@ train:
33
# Mode-specific setting
44
sync: true # Force synchronous mode
55

6-
# Shared training settings
7-
# Whether to use mixed precision training.
8-
mixed_precision: true
96
# Number of epochs to train for, every time a batch is collected.
107
epochs: 1
11-
# Number of steps in each batch. Higher values will cause the inference step to be slower, but won't use more GPU memory.
12-
steps_per_batch: 64
138
# Leave buffer_size empty to use steps_per_batch in sync mode
149
buffer_size:
15-
# Total number of dialog turns to collect during training.
16-
total_dialog_turns: 100_000
17-
# Batch size for optimization. Higher values will use more GPU memory.
18-
optim_batch_size: 1
19-
# Number of gradient accumulation steps. Higher values will use less GPU memory (comparing with bigger batches and lower gradient_accumulation_steps),
20-
# but will make the optimization step slower.
21-
gradient_accumulation_steps: 1
22-
# Whether to include the KL coefficient in the loss function. Alternatively, the KL ref-to-train will be added to the reward.
23-
kl_coef_in_loss: true
2410
# Update policy weights every N steps - must be left empty in sync mode
2511
weight_update_frequency:

0 commit comments

Comments
 (0)