Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen

2025-12-12

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Summary

This paper introduces a new way to check if large language models are reasoning correctly, focusing on improving the reliability of these checks without needing tons of human effort.

What's the problem?

Large language models are getting better at complex tasks, but verifying *how* they arrive at an answer is hard. Current methods either check only the final answer, which doesn't catch errors in the reasoning process, or try to check the reasoning step-by-step, but this requires a lot of expensive human labeling to know what's right and wrong. Basically, it's difficult to automatically and accurately assess the logic a model uses to solve a problem, especially when the reasoning is long and complicated.

What's the solution?

The researchers created a new verification method called Outcome-based Process Verifier, or OPV. It looks at the overall result of a reasoning process and then checks if the steps taken to get there make sense in relation to that result. To make this verifier even better, they use a smart learning process where the system identifies the cases it's most unsure about, asks experts to label those, and then uses that feedback to improve its ability to verify reasoning. This iterative process, combined with a technique called Rejection Fine-Tuning, allows OPV to learn effectively with less human input.

Why it matters?

This work is important because it makes it easier to build more trustworthy and reliable large language models. By improving the ability to verify reasoning, we can ensure these models aren't just giving correct answers, but are doing so for the right reasons. The OPV method achieves better results than existing methods and even outperforms larger models, while also being more efficient in terms of the amount of human labeling needed. This could lead to faster progress in developing AI systems we can truly depend on.

Abstract

Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \thisbench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.

View Paper