RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi
2025-11-11
Summary
This paper introduces a new way to train large language models (LMs) to be better at reasoning tasks by using a system called Reinforcement Learning with Adaptive Verifiable Environments, or RLVE.
What's the problem?
Training language models to reason effectively is hard because if the problems are too easy, the model doesn't learn anything new, and if they're too hard, the model gets stuck and can't improve. Traditional training methods often use a fixed set of problems, which quickly become too simple for the model as it gets better, leading to a loss of learning progress.
What's the solution?
The researchers created a system where the training problems automatically adjust in difficulty based on how well the language model is performing. They built a collection of 400 different reasoning tasks, called RLVE-Gym, and designed them so the system can verify if the model's answers are correct. As the model learns, the system presents it with increasingly challenging problems, keeping the learning process efficient and effective. They then trained a language model using this system and compared it to continuing the model's original training.
Why it matters?
This research shows that dynamically adjusting the difficulty of training problems significantly improves a language model's reasoning abilities. The model trained with RLVE-Gym performed much better on reasoning tests than the same model trained with the standard method, even though the standard method used a lot more computing power. This suggests that creating adaptable training environments is a key step towards building more intelligent and capable language models.
Abstract
We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.