Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning
Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao
2025-10-08
Summary
This paper focuses on improving how large language models learn through a method called Reinforcement Learning with Verifiable Rewards, specifically by making the exploration phase – where the model tries out different options – more effective.
What's the problem?
When training these models with reinforcement learning, it's hard to find the right balance between trying new things (exploration) and sticking to what already works well (exploitation). If you let the model explore *too* much, the responses become random and nonsensical. But if you don't let it explore *enough*, it gets stuck and doesn't learn to reason better. Simple methods like adjusting a 'temperature' setting struggle to get this balance right, either sacrificing quality or limiting discovery.
What's the solution?
The researchers propose a new strategy called Exploratory Annealed Decoding, or EAD. The core idea is that it's most important to be creative and explore different possibilities at the *beginning* of generating a response, as those first few words really set the direction. Then, as the response continues, the model should focus on producing a high-quality, coherent output. EAD achieves this by starting with a high 'temperature' (encouraging exploration) and gradually lowering it (focusing on exploitation) as the response is generated.
Why it matters?
This work is important because it offers a simple, yet powerful, way to improve the reasoning abilities of large language models. EAD is easy to implement and works well with existing training methods, making it a practical solution for enhancing the performance of these models and helping them learn more efficiently.
Abstract
Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.