DataDecide: How to Predict Best Pretraining Data with Small Experiments

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

2025-04-16

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Summary

This paper talks about DataDecide, a method for figuring out which training data will help large language models perform best, by running small and inexpensive experiments instead of training huge models from scratch every time.

What's the problem?

The problem is that training large language models on different datasets takes a lot of time and money, and it's hard to know in advance which data will actually lead to the best final models. If researchers guess wrong, they could waste a lot of resources on data that doesn't help much.

What's the solution?

The researchers showed that by using small-scale experiments and measuring how well smaller models do on certain benchmarks, they can accurately predict which datasets will work best for much larger models. They used continuous likelihood metrics as a shortcut, so they don't have to fully train giant models to make good decisions about which data to use.

Why it matters?

This matters because it saves a lot of time and money for anyone building large language models. It also helps make AI research more efficient and accessible, since people can make smarter choices about training data without needing huge amounts of computing power.

Abstract

Small-scale experiments accurately predict the best large language models when using continuous likelihood metrics as proxies.

View Paper