Deep Think with Confidence
Yichao Fu, Xuewei Wang, Yuandong Tian, Jiawei Zhao
2025-08-22
Summary
This paper introduces a new technique called DeepConf that improves how well large language models (LLMs) perform reasoning tasks, making them more accurate and efficient.
What's the problem?
When LLMs try to solve complex problems, a common method is to have them generate many different possible solutions and then pick the most common one. While this 'strength in numbers' approach works, it quickly becomes less effective as you ask for more solutions, and it takes a lot of computing power to generate all those options. Basically, it's like brainstorming – at some point, more ideas don't necessarily mean better ideas, and it takes a long time to sort through everything.
What's the solution?
DeepConf tackles this by having the LLM assess how confident it is in each step of its reasoning process. It then filters out the parts of the reasoning that the model itself deems unreliable. This means the model doesn't waste time and resources on shaky ideas, focusing instead on the most promising paths. Importantly, this doesn't require any extra training of the model or fiddling with settings; it can be added to existing systems easily.
Why it matters?
This is important because it allows us to get better results from LLMs without needing more powerful computers or spending more time waiting for answers. The paper shows DeepConf can significantly improve accuracy on difficult reasoning tests and dramatically reduce the amount of text the model generates, making it a practical solution for real-world applications.
Abstract
Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.