Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Haizhong Zheng¹, Jiawei Zhao², Beidi Chen¹
¹Carnegie Mellon University, ²Meta AI

TL;DR Our work shows that stale data can be as informative as on-policy data if exploited properly. We introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to stabilize training. Extensive evaluation across six model scales (1.7B–32B) demonstrates that M2PO achieves stable off-policy training even with data stale by at least 256 updates, matching on-policy performance.

🗞️ News

[2025.10.2] Blog post released: Prosperity Before Collapse – M2PO.
[2025.10.2] Paper preprint available on arXiv.

Figure 1 Comparison of on-policy GRPO and off-policy training under a staleness of 256 model updates on Qwen-2.5-32B. Left: Standard GRPO suffers from degradation with stale rollouts, while removing the trust region (GRPO no TR) reveals a clear prosperity-before-collapse phenomenon. In contrast, M2PO achieves stable training and matches on-policy performance even under high staleness. Right: Token clipping ratio comparison shows that M2PO dramatically reduces clipping events compared to GRPO, while avoiding training collapse.

The official implementation of M2PO will be released here soon!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
.gitignore		.gitignore
README.md		README.md
fig1.png		fig1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

🗞️ News

About

Uh oh!

Releases

Packages

Infini-AI-Lab/M2PO

Folders and files

Latest commit

History

Repository files navigation

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

🗞️ News

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages