Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou
2025-10-10
Summary
This paper focuses on improving how Large Language Models (LLMs) learn to solve complex problems using a technique called Reinforcement Learning with Verifiable Rewards (RLVR). It tackles the issue of LLMs getting 'stuck' during training and failing to explore enough different solutions.
What's the problem?
When LLMs are trained with RLVR, they initially improve, but eventually their performance plateaus. This happens because the model starts favoring only the most obvious answers and stops considering less likely, but potentially correct, options. The researchers found that the model unintentionally eliminates these 'reasoning sparks' – low-probability words or phrases that are actually crucial for finding the right solution. Existing methods try to fix this by simply encouraging the model to be random, but this can introduce noise and make training unstable.
What's the solution?
The researchers developed a new method called Low-probability Regularization (Lp-Reg). This technique doesn't just encourage randomness; instead, it identifies and protects those valuable 'reasoning sparks'. It works by creating a slightly modified version of the model’s predictions, filtering out what it considers to be 'noise' (unhelpful tokens), and then gently guiding the model to consider the remaining, potentially important options more often. This is done using a mathematical technique called KL divergence, which measures the difference between the original and modified predictions.
Why it matters?
This research is important because it allows LLMs to train for much longer and achieve better results on challenging tasks, specifically math problems. By preventing the model from prematurely discarding potentially useful ideas, Lp-Reg leads to a significant improvement in accuracy compared to previous methods, making LLMs more reliable and capable problem-solvers.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textit{reasoning sparks}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of reasoning sparks is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a 60.17% average accuracy on five math benchmarks, an improvement of 2.66% over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.