Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs

Deep Mehta

2026-01-13

Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs

Summary

This paper investigates whether a common technique used to improve the accuracy of large language models, called self-consistency, actually makes their reasoning *better* or just gives the right answer more often by chance.

What's the problem?

Large language models are getting better at tasks that require reasoning, and self-consistency – where the model generates multiple answers and picks the most frequent one – often improves their scores. However, it wasn't clear if this improvement meant the models were actually reasoning more correctly, or if they were just guessing more reliably. The researchers wanted to figure out if generating more reasoning steps (scaling up inference) leads to more *faithful* reasoning, meaning the reasoning process actually supports the answer.

What's the solution?

The researchers tested four powerful language models – GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2 – on a set of 100 math problems. They had each model solve the problems multiple times (up to five times) using self-consistency and then carefully analyzed the results. They used statistical methods like confidence intervals and paired comparisons to see if accuracy improved *and* if the reasoning behind the answers became more trustworthy as the number of attempts increased. They also looked at which types of problems each model struggled with.

Why it matters?

The study found that self-consistency doesn't always help, and can even *hurt* accuracy for some models. For example, Claude Opus 4.5 actually became less accurate with self-consistency, but its reasoning became much more reliable. This means that before using self-consistency in a real-world application, developers need to test it with their specific model and task to make sure it's actually improving the quality of the reasoning, not just the score.

Abstract

Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks. The approach is straightforward: generate multiple reasoning paths and select the most common answer through majority voting. While this reliably boosts accuracy, it remains unclear whether these gains reflect genuine improvements in reasoning quality. We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness? We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems. Our analysis employs bootstrap confidence intervals, McNemar's tests for paired comparisons, and Cohen's d effect sizes to quantify the effects rigorously. The results reveal striking differences across models that challenge common assumptions about self-consistency. GPT-5.2 shows the expected pattern: accuracy improves from 78% to 90% at N=5, with faithfulness remaining relatively stable (0.540 to 0.510). Claude Opus 4.5 tells a completely different story. Its accuracy actually drops from 78% to 74.3% while faithfulness jumps dramatically from 0.270 to 0.891 at N=5. DeepSeek-v3.2, already at 98% accuracy, shows ceiling effects with modest faithfulness gains (0.440 to 0.541). Gemini-3-flash improves from 81% to 86% accuracy with a slight faithfulness decrease (0.260 to 0.212). Problem difficulty analysis reveals that GPT-5.2 solves 82% of hard problems while breaking only 13% of easy ones. Claude, in contrast, breaks 23% of easy problems, explaining its accuracy decrease. These findings matter for practitioners: self-consistency is not universally beneficial, and teams should test their specific models before deployment. We release our code and provide practical recommendations for navigating these tradeoffs.

View Paper