RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu

2025-11-11

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Summary

This paper focuses on a problem that happens when using a type of artificial intelligence called Reinforcement Learning to train really smart models, specifically those that need to reason through problems. It's about how these models can become *too* good at the training tasks but then struggle with new, unseen problems.

What's the problem?

When you train these reasoning models using Reinforcement Learning, they can start to 'overfit'. Think of it like studying for a specific test and memorizing all the answers, but then being unable to apply that knowledge to slightly different questions. The models become too focused on the exact way to get rewards during training and forget how to solve problems in a more general way. They lose the ability to explore different approaches and essentially 'forget' good solutions they found earlier in the process.

What's the solution?

The researchers developed a system called RLoop. It works by creating a repeating cycle of learning and improvement. First, the model uses Reinforcement Learning to try and find solutions. Then, the *best* attempts are saved as examples. These examples are used to refine the model, making it even better. This improved model then starts the process over again, exploring and learning from its successes. This continuous loop helps the model remember different strategies and avoid getting stuck on just one specific approach.

Why it matters?

This research is important because it helps us build more reliable and adaptable AI systems. By preventing the models from 'forgetting' and improving their ability to generalize, we can create AI that performs better on a wider range of tasks. The experiments showed a significant improvement in accuracy and problem-solving ability, meaning this technique could lead to more powerful and useful AI applications.

Abstract

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

View Paper