OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen
2025-12-12
Summary
This paper introduces a new way to check if large language models are reasoning correctly, focusing on improving the reliability of these models when they're trying to solve complicated problems that require multiple steps of thought.
What's the problem?
Large language models are getting better at complex tasks, but it's hard to know if they're arriving at the right answer *for the right reasons*. Current methods either only check the final answer, which doesn't reveal errors in the reasoning process, or struggle to accurately identify mistakes within the detailed steps of reasoning because getting people to manually check these steps is expensive and time-consuming.
What's the solution?
The researchers created a new verification method called Outcome-based Process Verifier, or OPV. This method checks the reasoning steps by looking at how well they support the final outcome. To make OPV really good, they used a smart learning process where the system identifies the reasoning chains it's most unsure about, asks experts to check those, and then uses that feedback to improve its ability to verify future reasoning. They also used a technique called Rejection Fine-Tuning to help the model learn from its mistakes.
Why it matters?
This work is important because it allows for more reliable verification of complex reasoning in large language models without needing a huge amount of human effort. The OPV system performs better than existing methods and even outperforms much larger models, leading to more accurate results when these models are used to solve problems. It’s a step towards building AI systems we can trust to not only give the right answer, but also to explain *how* they got there.
Abstract
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.