OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

2026-02-11

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Summary

This paper addresses the challenge of improving how AI models learn from text data when the best, readily available text has already been used up, a situation called the 'Data Wall'. It introduces a new method for selecting which pieces of text are most helpful for training, leading to better performance with less data.

What's the problem?

As AI models get bigger, they need more and more text data to learn effectively. However, we're starting to run out of high-quality, publicly available text. Existing methods for choosing which data to use either make simple, unchanging choices or react to how the model is learning, but don't fully consider *how* the optimization process actually works within the model itself. This means they aren't always picking the most useful data.

What's the solution?

The researchers developed a system called OPUS that intelligently selects training data based on how the model's optimizer – the part that adjusts the model’s settings during learning – will actually *use* that data. OPUS predicts how each piece of text will change the model and prioritizes those that move it in a helpful direction, based on what the model already knows. To make this efficient, they used clever techniques to speed up the calculations and ensure a variety of data is used. It only adds a small amount of extra computing time.

Why it matters?

This work is important because it shows we can continue to improve AI models even when we've exhausted the best available data. By being smarter about *which* data we use, OPUS allows models to learn more effectively with less information, saving time and resources. It’s particularly useful for specialized areas like science where high-quality data is scarce, and it can even improve performance when combined with existing data filtering methods.

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

View Paper