Efficient Test-Time Scaling via Self-Calibration

Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, Jiaxin Huang

2025-03-04

Efficient Test-Time Scaling via Self-Calibration

Summary

This paper talks about a new way to make AI language models (LLMs) work better and more efficiently when answering questions, called Self-Calibration. It focuses on improving how these models use extra thinking time during the answering process.

What's the problem?

Current methods for improving AI answers by giving the AI more time to think aren't very efficient. They use the same amount of extra thinking for every question, which wastes time on easy questions and might not be enough for hard ones. Also, AI models aren't very good at knowing when they're confident about their answers.

What's the solution?

The researchers created a method called Self-Calibration. This teaches the AI to better understand when it's confident about its answers. Then, they used this improved confidence to decide how much extra thinking time the AI should use for each question. For easy questions, the AI can stop thinking earlier, and for hard questions, it can spend more time. They also improved existing methods like Best-of-N and Self-Consistency to work better with this new confidence system.

Why it matters?

This matters because it makes AI language models more efficient and accurate. By using computer power more wisely, these AIs can give better answers without wasting resources. In tests, this method improved the accuracy of math question answering from 81% to 83.6% when the AI was allowed to think about 16 different answers. This could lead to smarter, more reliable AI assistants that can handle a wide range of questions more effectively.

Abstract

Increasing test-time computation is a straightforward approach to enhancing the quality of responses in Large Language Models (LLMs). While Best-of-N sampling and Self-Consistency with majority voting are simple and effective, they require a fixed number of sampling responses for each query, regardless of its complexity. This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones. In this work, we argue that model confidence of responses can be used for improving the efficiency of test-time scaling. Unfortunately, LLMs are known to be overconfident and provide unreliable confidence estimation. To address this limitation, we introduce Self-Calibration by distilling Self-Consistency-derived confidence into the model itself. This enables reliable confidence estimation at test time with one forward pass. We then design confidence-based efficient test-time scaling methods to handle queries of various difficulty, such as Early-Stopping for Best-of-N and Self-Consistency with calibrated confidence. Experiments on three LLMs across six datasets demonstrate the effectiveness of our approach. Specifically, applying confidence-based Early Stopping to Best-of-N improves MathQA accuracy from 81.0 to 83.6 with a sample budget of 16 responses, indicating the efficacy of confidence-based sampling strategy at inference time.

View Paper