Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou
2025-02-11
Summary
This paper talks about how smaller AI models can outperform much larger ones by using a method called Test-Time Scaling (TTS), which gives them extra computing power during the process of answering questions or solving problems.
What's the problem?
Large language models (LLMs) are powerful but require a lot of resources to train and run. Smaller models are cheaper and faster but often don’t perform as well. Current research hasn’t fully explored how to make smaller models more effective using additional computation at test time, especially for complex tasks.
What's the solution?
The researchers studied how to optimize TTS, a technique that allows AI models to use more computing power during inference (when they generate answers). They found that by adapting TTS strategies based on the difficulty of the task, smaller models could achieve better results than much larger ones. For example, a 1-billion-parameter model outperformed a 405-billion-parameter model on certain math tasks. They also showed that this approach reduces computational costs while improving performance.
Why it matters?
This matters because it challenges the idea that bigger models are always better. By optimizing how smaller models use computing power at test time, we can create more efficient and cost-effective AI systems. This has implications for making advanced AI accessible on devices with limited resources, like smartphones, and for reducing the environmental impact of running large-scale AI systems.
Abstract
Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.