Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
2025-10-07
Summary
This paper introduces a new method, Reinforce-Ada, to improve how large language models (LLMs) learn through reinforcement learning, specifically when tackling reasoning problems.
What's the problem?
When you try to teach an LLM to reason using reinforcement learning, it can be tricky because the feedback the model gets (the 'gradient') is often noisy and unreliable. This happens because the model is given a bunch of different questions (prompts), and it tries to learn from the answers it generates. But if the model is unsure about some questions, the feedback from those can throw off the learning process, making it slow and unstable. Previous methods tried to fix this by deciding *how much* effort to put into each question upfront, but that isn't always the best approach.
What's the solution?
Reinforce-Ada solves this by constantly monitoring how well the LLM is learning from each question *while* it's learning. It focuses its efforts on the questions where the model is most uncertain or has the most potential to improve. Instead of deciding upfront how many tries each question gets, it keeps sampling answers until it's confident it has enough information. To make the learning even more stable, it groups similar questions together and uses a consistent way to measure how good the answers are. This happens in a continuous loop of testing and adjusting, stopping sampling for a question once it's learned enough.
Why it matters?
This work is important because it makes reinforcement learning for LLMs more efficient and reliable. By smartly choosing which questions to focus on, Reinforce-Ada helps LLMs learn faster and perform better on reasoning tasks. This is a big step towards building more intelligent and capable AI systems that can truly understand and solve complex problems.
Abstract
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.