What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, Chen Ma

2025-04-01

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large
Language Models

Summary

This paper is about a technique called 'test-time scaling' used to improve the problem-solving abilities of large language models (LLMs) during testing.

What's the problem?

Even with lots of data, LLMs still need to be improved to solve complex problems; so a different training technique is needed.

What's the solution?

The paper breaks down test-time scaling into four main areas: what to scale, how to scale, where to scale, and how well to scale. It then looks at different methods and applications within each of these areas.

Why it matters?

This work matters because it provides a better understanding of test-time scaling and offers guidance for using it to improve LLMs in various tasks, such as answering questions and solving math problems.

Abstract

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

View Paper