Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
2025-10-13
Summary
This paper focuses on improving how large language models, like the ones powering chatbots, learn to reason and solve problems. It tackles the issue that current models are good at *imitating* how humans write, but struggle with actual thinking and problem-solving.
What's the problem?
Large language models are typically trained by reading massive amounts of text and learning to predict the next word. While this works well for generating text, it doesn't necessarily teach the model to *reason* effectively. A better approach is reinforcement learning, where the model learns through trial and error and receives rewards for good answers. However, reinforcement learning needs a lot of high-quality examples to work well, and creating these examples is expensive and time-consuming, meaning there aren't enough to train these models effectively.
What's the solution?
The researchers created a system called Webscale-RL that automatically generates millions of question-answer pairs from existing text on the internet. This system essentially turns readily available information into training data for reinforcement learning. They then used this generated data, called the Webscale-RL dataset, to train a language model using reinforcement learning techniques.
Why it matters?
This work is important because it provides a way to scale up reinforcement learning for language models, making it more practical. The model trained with this new dataset performed better than models trained with other methods and did so using significantly less training data – up to 100 times less! This means we can potentially build more capable and efficient language models without needing enormous amounts of labeled data, paving the way for more advanced AI.
Abstract
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100times fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.