The Art of Scaling Test-Time Compute for Large Language Models

Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty

2025-12-02

The Art of Scaling Test-Time Compute for Large Language Models

Summary

This paper investigates different methods for making large language models (LLMs) work more efficiently during use, specifically by changing how much computing power is used while they're generating answers. It's about finding the best way to balance speed, cost, and accuracy when using these powerful AI systems.

What's the problem?

Currently, there's no clear understanding of which 'test-time scaling' (TTS) techniques work best for LLMs. These techniques involve dynamically adjusting the amount of computation used while the model is running. The paper points out that no one has systematically compared these methods under the same conditions, and it's unclear how the type of model or how hard the problem is affects the results. Basically, we don't know *how* to best use these scaling techniques.

What's the solution?

The researchers conducted a large-scale experiment, generating text with eight different open-source LLMs, ranging in size, and tested them on four different reasoning tasks. They found that no single TTS method is always the best. Instead, they categorized models based on how well their 'thinking process' (called a trace) aligns with the difficulty of the problem and how long the trace is. They also discovered that, generally, giving a model more computing power consistently improves its performance. Based on these findings, they created a guide to help people choose the best TTS strategy based on the specific problem, model, and available computing resources.

Why it matters?

This research is important because it provides practical advice for using LLMs more effectively. By understanding how different TTS strategies perform under various conditions, users can optimize their models for speed, cost, and accuracy. This is crucial as LLMs become more widespread and are used for increasingly complex tasks, making efficient use of computing resources a key concern.

Abstract

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

View Paper