Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

2025-02-05

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling
Verification

Summary

This paper talks about how to improve the process of selecting the best responses from AI models during testing by using a method called sampling-based search. It studies how scaling up this method can make AI reasoning better and introduces new techniques to enhance accuracy and efficiency.

What's the problem?

AI models often generate multiple possible answers during testing, but choosing the best one can be tricky. Current methods rely on verifying each answer for correctness, but they don't always work well, especially when dealing with complex reasoning tasks. Additionally, AI models are not very good at checking their own answers out of the box.

What's the solution?

The researchers found that increasing the number of sampled answers improves the accuracy of verification, a phenomenon they call implicit scaling. They also introduced two principles to improve self-verification: comparing answers to spot errors and using different output styles for different tasks. They proposed a benchmark to measure how well AI models verify their own answers and developed techniques to make this process more reliable.

Why it matters?

This research is important because it shows how to make AI models better at selecting correct answers without needing extra resources. By improving both reasoning and verification, these methods can help create smarter and more reliable AI systems for real-world applications.

Abstract

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

View Paper