A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma

2025-10-20

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

Summary

This paper investigates a way to make large language models, like those used for chatbots, better at complex reasoning tasks by using more computing power during the problem-solving process, specifically when they're *using* the model, not training it.

What's the problem?

Currently, a common technique to improve reasoning is to have the model generate multiple possible solutions and then choose the best one. While this works well in practice, no one really understands *why* it works so well from a theoretical standpoint. Existing methods, like 'self-consistency' and using 'perplexity' to judge answers, have flaws: self-consistency can be inaccurate, and perplexity can actually get worse over time and doesn't always pinpoint the right answer.

What's the solution?

The researchers developed a new method called RPC, which stands for Perplexity Consistency and Reasoning Pruning. It combines the strengths of the existing methods while fixing their weaknesses. It uses perplexity to quickly narrow down the best possible solutions (improving speed) and then prunes away the less likely reasoning paths to avoid errors. Essentially, it's a smarter way to explore multiple solutions.

Why it matters?

This research is important because it provides a theoretical understanding of *why* increasing computing power at test time improves reasoning in large language models. The new RPC method not only performs as well as existing techniques but also does so more reliably and with less computational cost, meaning it can get better answers faster and cheaper. This could lead to more effective and efficient AI systems.

Abstract

Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.

View Paper