Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting
Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe
2025-10-13
Summary
This paper focuses on improving how we train large language models (LLMs) to be better at reasoning, specifically using a technique called Reinforcement Learning with Verifiable Rewards. It identifies a weakness in a common method used for this training and proposes a new approach to fix it.
What's the problem?
Currently, a popular method called Group Relative Policy Optimization (GRPO) is used to train LLMs with rewards. However, GRPO often wastes computing power when it encounters a group of responses where *none* of them are correct. Because no response is right, the system doesn't learn anything from those attempts, essentially throwing away valuable processing time. The question is whether we can get useful information even from these 'negative' groups where all answers are wrong.
What's the solution?
The researchers realized that the way the reward model is built, based on maximizing the likelihood of correct answers, can be reinterpreted as a way to penalize incorrect answers. They developed a new method called Likelihood Estimation with Negative Samples (LENS) that adds a penalty to incorrect responses, and the size of the penalty depends on how *confident* the model was when making the wrong answer. So, if the model is very sure it's right but is actually wrong, it gets a bigger penalty. This allows the system to learn from its mistakes, even in groups where all initial answers were incorrect, turning wasted attempts into useful learning signals.
Why it matters?
This research is important because it makes training LLMs more efficient and effective. By 'rescuing' those previously wasted negative groups, the model can learn faster and perform better on challenging reasoning tasks, like solving math problems. This means we can get more out of the computing resources we use to train these powerful AI systems.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as Likelihood Estimation with Negative Samples (LENS). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.