Haizhong Zheng1, Jiawei Zhao2, Beidi Chen1
1Carnegie Mellon University,
2Meta AI
TL;DR Our work shows that stale data can be as informative as on-policy data if exploited properly. We introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to stabilize training. Extensive evaluation across six model scales (1.7B–32B) demonstrates that M2PO achieves stable off-policy training even with data stale by at least 256 updates, matching on-policy performance.
- [2025.10.2] Blog post released: Prosperity Before Collapse – M2PO.
- [2025.10.2] Paper preprint available on arXiv.
Figure 1 Comparison of on-policy GRPO and off-policy training under a staleness of 256 model updates on Qwen-2.5-32B. Left: Standard GRPO suffers from degradation with stale rollouts, while removing the trust region (GRPO no TR) reveals a clear prosperity-before-collapse phenomenon. In contrast, M2PO achieves stable training and matches on-policy performance even under high staleness. Right: Token clipping ratio comparison shows that M2PO dramatically reduces clipping events compared to GRPO, while avoiding training collapse.
The official implementation of M2PO will be released here soon!