Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin
2026-02-13
Summary
This paper investigates how to make large language models better at complex reasoning during tasks, specifically when they need to consider multiple possibilities before arriving at an answer.
What's the problem?
Large language models sometimes struggle with tasks that require exploring different ideas and checking their validity. The issue is that as models generate longer sequences of thought – which is needed for thorough exploration – the chance of them actually *producing* those longer sequences gets smaller and smaller. It's like they get stuck in a 'shallow exploration trap,' quickly settling on an initial idea instead of investigating alternatives.
What's the solution?
The researchers developed a technique called Length-Incentivized Exploration. Essentially, they added a reward system that encourages the model to generate longer reasoning paths, but also penalizes it for repeating itself. This pushes the model to explore more diverse possibilities and cover more 'states' in its reasoning process, leading to more robust conclusions.
Why it matters?
This work is important because it improves the ability of language models to tackle challenging problems that require careful thought and consideration of multiple options. The improvements shown across different models and tasks suggest this technique could be widely applicable, making these models more reliable and effective in real-world scenarios.
Abstract
Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.