RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li
2025-08-07
Summary
This paper talks about RL-PLUS, a new method that improves how large language models learn through reinforcement learning by combining different strategies that allow the model to explore new problem-solving paths while making better use of existing knowledge.
What's the problem?
The problem is that when large language models are trained with reinforcement learning, they often hit a limit called capability boundary collapse, where the model’s ability to solve problems doesn’t improve even with more training because of the model's huge range of possible actions and limited feedback.
What's the solution?
The solution was to create a hybrid-policy optimization method called RL-PLUS, which blends internal learning from the model’s own experiences with external data using a technique called Multiple Importance Sampling and an exploration-based advantage function. This helps the model discover better reasoning paths and avoid getting stuck at its current limits.
Why it matters?
This matters because enhancing the reasoning ability of large language models allows them to solve more complex problems and handle tasks outside their usual range, making AI systems smarter and more capable in real-world applications.
Abstract
RL-PLUS, a hybrid-policy optimization approach, enhances LLM reasoning capabilities by integrating Multiple Importance Sampling and Exploration-Based Advantage Function, outperforming RLVR on various benchmarks and resolving capability boundary collapse.