Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar

2024-08-07

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Summary

This paper explores how using more computational power during the testing phase of language models (LLMs) can lead to better performance than just increasing the size of the models themselves.

What's the problem?

As language models become more advanced, they need to handle complex questions and tasks effectively. However, simply making these models larger is not always the best solution. It can be expensive and inefficient. Researchers want to find ways to improve model performance without just scaling up the model size.

What's the solution?

The authors propose a method called 'compute-optimal scaling,' which focuses on allocating computational resources intelligently during the testing phase. They analyze two main strategies: one involves using a verifier model to check answers, and the other adapts the model's response based on the difficulty of the prompt. By applying this strategy, they found that it can improve performance by more than four times compared to traditional methods, allowing smaller models to outperform much larger ones in certain situations.

Why it matters?

This research is important because it shifts the focus from simply making models bigger to using computational resources more effectively. This approach can lead to more efficient AI systems that perform better on complex tasks while saving time and resources, which is crucial as AI continues to evolve.

Abstract

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

View Paper