JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao
2026-01-14
Summary
This paper focuses on improving how large language models solve problems, specifically in areas like math, using a technique called Reinforcement Learning with Verifiable Rewards. It addresses the issue of these models often rambling and trying many incorrect answers before finding the right one.
What's the problem?
Large language models, when trained to simply get the right answer, tend to explore solutions in a very inefficient way. They essentially guess a lot and check if the guess is correct, leading to long and unnecessary explanations. While you can try to limit how much they write, this often cuts off important steps in their reasoning process, making it hard to balance being concise and being correct.
What's the solution?
The researchers propose a new method called JudgeRLVR, which works in two steps. First, the model is trained to *evaluate* potential solutions, learning to tell good answers from bad ones. Then, the model is trained to *generate* solutions, but it starts from the knowledge it gained during the evaluation stage. This helps the model focus its efforts on promising approaches and avoid wasting time on things it already knows are wrong.
Why it matters?
This research is important because it shows a way to make large language models not only more accurate but also more efficient in their problem-solving. The model performs better on both problems it has seen before and new, unseen problems, suggesting it’s learning a more general ability to reason and solve problems effectively. This could lead to AI systems that are more helpful and less prone to getting stuck in endless loops of trial and error.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.