A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
2024-12-03

Summary
This paper discusses a new two-stage algorithm designed to improve how large language models (LLMs) solve problems by using a method that resembles a knockout tournament to pick the best solution.
What's the problem?
Large language models can generate many possible answers to a problem, but it can be hard to determine which answer is the best. Traditional methods may not effectively narrow down the options, leading to less accurate results. Additionally, there isn't always a straightforward way to measure how well these models perform when they have to make decisions based on multiple candidate solutions.
What's the solution?
The authors propose a two-stage algorithm where the model first generates several candidate solutions. Then, it uses a knockout tournament format to compare these candidates in pairs multiple times. Only the winners of each round move on to the next round until one final solution is selected. This method allows for efficient testing and comparison of solutions without needing extra tools or models, relying solely on the LLM itself. The researchers also provide mathematical proof that as more candidates and comparisons are made, the chances of making an incorrect choice decrease significantly.
Why it matters?
This research is important because it offers a systematic way to improve the decision-making abilities of large language models. By using this knockout tournament approach, models can potentially provide better answers with less computational effort, making them more effective for real-world applications like question answering, content generation, and problem-solving tasks.
Abstract
We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates N candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for K times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of N times (K + 1) highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability p_{gen} > 0 and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability p_{comp} > 0.5 (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to N and K: $P(final output is incorrect) le (1 - p_{gen})^N + lceil log_2 N rceil e^{-2 K (p_{comp} - 0.5)^2}.$ Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.