Predicting Task Performance with Context-aware Scaling Laws
Kyle Montgomery, David Park, Jianhong Tu, Michael Bendersky, Beliz Gunel, Dawn Song, Chenguang Wang
2025-10-17
Summary
This paper investigates how well the size of a language model, the amount of data it's trained on, and the amount of information given to it (context) relate to how well it performs on specific tasks.
What's the problem?
Traditionally, we can predict how well a language model will do based on how much it's trained and its size. However, these predictions don't work well when considering tasks where the model needs to use a lot of surrounding information, or 'context', to get the right answer. The existing rules don't account for how important context is for real-world performance.
What's the solution?
The researchers created a new way to predict performance that considers both the training a model receives *and* how much context is provided. They tested this new method using different versions of the Llama-2 language model on tasks like math problems, common sense questions, and translation, using a huge number of different examples. They showed their method accurately predicts performance and can even estimate how performance will change with more context.
Why it matters?
Understanding how training and context work together is crucial for building better language models. This research gives us a better understanding of this relationship, which can help us design more efficient models that can handle complex tasks requiring a lot of information, without needing to just make the model massively larger.
Abstract
Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.