GRPO Unstable grad_norm after >10k steps #2980

nopepper · 2025-02-27T21:29:15Z

Not sure if this is a bug per se, it's possible that I'm missing something here. If this is a known fixable issue, it would be great to note it in the GRPOTrainer docs.

Reproduction

It seems like grad_norm eventually starts to become unstable and spike higher and higher after >10k steps in GRPO.

I've tried both single-GPU and multi-GPU training, various ~~values for beta~~ and batch sizes/learning rates. I have not tested if this problem happens without vLLM, though.

Weirdly enough, this does not affect the reward, which keeps going up, but only the grad_norm and clip_ratio metrics.

Training args (the relevant ones):

warmup_steps = 100
num_train_epochs = 1

per_device_train_batch_size = 16

bf16 = true
optim = "adamw_bnb_8bit"
learning_rate = 3e-6

gradient_checkpointing = false

# --- New fields required by GRPOConfig ---
seed = 42
gradient_accumulation_steps = 4
max_prompt_length = 256
max_completion_length = 128
num_generations = 16
temperature = 1.0
num_iterations = 4
beta = 0.0
use_vllm = true
vllm_gpu_memory_utilization = 0.8
vllm_max_model_len = 4096
vllm_dtype = "bfloat16"
vllm_device = "auto"
vllm_enable_prefix_caching = true
epsilon = 0.2

System Info

Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
Python version: 3.11.9
TRL version: 0.16.0.dev0
PyTorch version: 2.5.1+cu121
CUDA device(s): NVIDIA A100-SXM4-80GB
Transformers version: 4.49.0
Accelerate version: 1.4.0
Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: True
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Datasets version: 3.3.2
HF Hub version: 0.29.1
bitsandbytes version: 0.45.3
DeepSpeed version: 0.16.4
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: 1.64.0
PEFT version: 0.14.0
vLLM version: 0.7.2

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

nopepper · 2025-02-28T07:57:32Z

I was wrong, I had beta=0.0 in all my experiments. Setting beta=0.001 was enough to prevent the gradient explosion. Perhaps we shouldn't suggest that option in the docs so prominently?

qgallouedec · 2025-02-28T15:24:36Z

I suspected that this could produce surprising results on a long run. #2806 (comment)

Would you recommend adding some sort of warning in the documentation?

nopepper · 2025-02-28T15:30:41Z

Sounds good. Perhaps something like this?

KL coefficient. If `0.0`, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs.

qgallouedec · 2025-02-28T15:56:08Z

Looks good! Are you willing to open a PR?

github-actions bot added 🏋 GRPO Related to GRPO 🐛 bug Something isn't working labels Feb 27, 2025

nopepper mentioned this issue Feb 28, 2025

🔍 Update GRPO config documentation for beta parameter stability #2992

Merged

5 tasks

qgallouedec closed this as completed in #2992 Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO Unstable grad_norm after >10k steps #2980

GRPO Unstable grad_norm after >10k steps #2980

nopepper commented Feb 27, 2025 •

edited

Loading

nopepper commented Feb 28, 2025

qgallouedec commented Feb 28, 2025

nopepper commented Feb 28, 2025 •

edited

Loading

qgallouedec commented Feb 28, 2025

GRPO Unstable grad_norm after >10k steps #2980

GRPO Unstable grad_norm after >10k steps #2980

Comments

nopepper commented Feb 27, 2025 • edited Loading

Reproduction

System Info

Checklist

nopepper commented Feb 28, 2025

qgallouedec commented Feb 28, 2025

nopepper commented Feb 28, 2025 • edited Loading

qgallouedec commented Feb 28, 2025

nopepper commented Feb 27, 2025 •

edited

Loading

nopepper commented Feb 28, 2025 •

edited

Loading