Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Haizhong Zheng, Jiawei Zhao, Bedi Chen

2025-10-07

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Summary

This paper focuses on improving how large language models learn through a method called reinforcement learning, specifically making the learning process more efficient and scalable.

What's the problem?

Reinforcement learning is great for improving language models, but a common approach requires constantly generating new data with every update to the model. This is slow and doesn't work well when you try to use data collected earlier in the process, because the model changes and the old data becomes less relevant. Existing methods either perform poorly with older data or completely fall apart when trying to use it.

What's the solution?

The researchers discovered that older data isn't necessarily *bad* – it can still be useful if handled correctly. They developed a new technique called M2PO which focuses on controlling how much weight is given to potentially unreliable data. It identifies and minimizes the impact of extreme, high-variance data points that throw off the learning process, while still allowing the model to learn from the generally informative older data. Essentially, it filters out the noise without throwing out the signal.

Why it matters?

This work is important because it allows language models to learn more efficiently by reusing older data, which saves time and resources. It enables training larger models and using more data overall, leading to potentially better performance. The method proves that you can effectively train these models even when the data used for learning is significantly outdated, opening up new possibilities for scaling up reinforcement learning for language models.

Abstract

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

View Paper