Loss-to-Loss Prediction: Scaling Laws for All Datasets
David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade
2024-11-21

Summary
This paper discusses a new method called Loss-to-Loss Prediction, which helps predict how well machine learning models will perform on different datasets by understanding the relationships between their training losses.
What's the problem?
While researchers know how to predict the performance of models trained on one dataset, they don't fully understand how to adjust these predictions when switching to a different dataset. This lack of knowledge makes it difficult to know how well a model will perform on new data, especially when the datasets are very different from each other.
What's the solution?
The authors developed a strategy that allows them to predict one model's loss (or error) based on another model's loss when trained on different datasets. They found that there are simple mathematical relationships (called shifted power laws) that connect the training losses of models trained on separate datasets. This method can also predict how a model trained on one dataset will perform when tested on another dataset. Their predictions remain accurate even when scaling up the computations significantly.
Why it matters?
This research is important because it provides a way to better understand and anticipate how machine learning models will behave with different datasets. By improving prediction accuracy, it can help researchers and developers create more effective models and streamline the process of training them on new data, ultimately enhancing the performance of AI systems.
Abstract
While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.