Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Kashun Shum, Yuzhen Huang, Hongjian Zou, Ding Qi, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
2025-03-03
Summary
This paper talks about a new way to choose the best data for training large language models called Predictive Data Selection (PreSelect). It's like finding the most nutritious food to help an AI grow smarter, faster.
What's the problem?
Training big AI language models requires a huge amount of data, which takes a lot of time and computing power. Not all data is equally useful for teaching the AI, but it's hard to know in advance which data will be most helpful.
What's the solution?
The researchers created PreSelect, a method that picks out the most valuable data for training. They found that data which helps predict how well the AI will perform on specific tasks is also the best data for teaching the AI. They use a simple tool to score different pieces of data based on how predictive they are, and then use the highest-scoring data for training.
Why it matters?
This matters because it can make training AI models much more efficient. The researchers showed that using PreSelect, they could train an AI model on just 10% of the usual amount of data and still get better results than training on all the data. This could make it much cheaper and faster to create powerful AI models, potentially leading to more advanced AI technology being developed more quickly and by more people or companies.
Abstract
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmark (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce data selection based on data's Predictive strength (Preselect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpasses the performance of a vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.