R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
2025-10-13
Summary
This paper investigates how well current AI reasoning models can handle complex problems that require many steps to solve, and introduces a new way to both test and improve these models' abilities in this area.
What's the problem?
Existing tests for AI reasoning models focus on simple, straightforward questions. They don't really challenge the AI to think through a series of connected problems over a long period of time, meaning we don't know how well these models can actually plan and reason through complicated scenarios that require multiple steps and remembering information from earlier stages.
What's the solution?
The researchers created a new testing method called R-HORIZON. This method builds complex questions by linking multiple problems together, forcing the AI to reason across a longer 'horizon' of steps. They then used R-HORIZON to test several advanced AI models and found they struggled with these multi-step problems. To help the models improve, they used R-HORIZON to create training data for a technique called reinforcement learning, which helped the AI learn to better manage its 'thinking' and improve its accuracy, even on simpler tasks.
Why it matters?
This work is important because it highlights a weakness in current AI reasoning models – their inability to effectively handle long, complex problems. By providing a better way to test and train these models, the researchers are helping to build AI that can tackle more realistic and challenging tasks, ultimately leading to more capable and reliable AI systems.
Abstract
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.