Test-Time Scaling Makes Overtraining Compute-Optimal

Nicholas Roberts, Sungjun Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu, Gabriel Orlanski, Avi Trost, Kelly Buchanan, Aws Albarghouthi, Frederic Sala

2026-04-06

Test-Time Scaling Makes Overtraining Compute-Optimal

Summary

This paper investigates how to best train large language models (LLMs) when you consider not just how well they perform, but also how much it *costs* to use them. It proposes new rules for scaling LLMs that balance model size, the amount of data they're trained on, and how many times you need to run them to get a good answer.

What's the problem?

Currently, LLMs are getting bigger and better, but using them takes more and more computing power, especially when you need to run them multiple times to get a reliable result. Existing guidelines for training LLMs, like the 'Chinchilla' scaling laws, don't account for this cost. This means we're often stuck choosing between a model that's accurate but expensive, or one that's cheap but not very good. It's a trade-off that wasn't being addressed.

What's the solution?

The researchers developed 'Train-to-Test' (T^2) scaling laws. These laws look at the whole process – both training the model *and* using it – to find the sweet spot. They consider how model size, training data, and the number of times you run the model all affect the final cost and performance. They tested these laws on eight different tasks and found that, to get the best results when considering cost, you actually need to train models *more* than previously thought, even to the point of 'overtraining' them.

Why it matters?

This work is important because it provides a more realistic way to build and deploy LLMs. As these models become more powerful, cost becomes a major factor. By jointly optimizing training and testing, T^2 scaling laws can help us create LLMs that are both high-performing and affordable, even as they continue to evolve with post-training improvements.

Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.

View Paper