Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi

2024-12-06

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Summary

This paper talks about establishing task scaling laws and model ladders to predict how well pretrained language models will perform on specific tasks, especially when they are heavily trained.

What's the problem?

Current methods for predicting how well language models will do on different tasks aren't very accurate, especially when it comes to understanding the performance of these models after extensive training. This makes it hard for researchers to know how to improve the models effectively.

What's the solution?

The authors propose a two-step approach to make better predictions. First, they use the size of the model and the amount of data it was trained on to estimate a task-specific loss (how well the model is expected to perform). Then, they use this loss to predict how well the model will do on that task. They also create smaller 'ladder' models that require much less computing power to train, which helps them gather data to improve their predictions. Their method shows good accuracy in predicting the performance of larger models on various tasks.

Why it matters?

This research is important because it helps improve our understanding of how language models work and how to make them better. By establishing more reliable ways to predict model performance, researchers can develop more effective AI systems that can handle a wider range of tasks, leading to advancements in natural language processing and other AI applications.

Abstract

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.

View Paper