Skip to content

Infini-AI-Lab/M2PO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Haizhong Zheng1, Jiawei Zhao2, Beidi Chen1
1Carnegie Mellon University, 2Meta AI

[Paper] | [Blog]

TL;DR Our work shows that stale data can be as informative as on-policy data if exploited properly. We introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to stabilize training. Extensive evaluation across six model scales (1.7B–32B) demonstrates that M2PO achieves stable off-policy training even with data stale by at least 256 updates, matching on-policy performance.

🗞️ News

M2PO Overview

Figure 1 Comparison of on-policy GRPO and off-policy training under a staleness of 256 model updates on Qwen-2.5-32B. Left: Standard GRPO suffers from degradation with stale rollouts, while removing the trust region (GRPO no TR) reveals a clear prosperity-before-collapse phenomenon. In contrast, M2PO achieves stable training and matches on-policy performance even under high staleness. Right: Token clipping ratio comparison shows that M2PO dramatically reduces clipping events compared to GRPO, while avoiding training collapse.

The official implementation of M2PO will be released here soon!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published