Confidence Estimation for LLMs in Multi-turn Interactions
Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang River Dong, Deqing Yang, Nigel Collier
2026-01-06
Summary
This research paper investigates how well Large Language Models (LLMs) can tell us how confident they are in their answers, specifically when having a back-and-forth conversation with a user.
What's the problem?
Current methods for checking LLM confidence work okay when the model is asked a single question, but they haven't been tested in real conversations where the meaning gets clearer as you exchange more messages. It's important for things like robots or assistants to know when they *don't* know something so they don't give wrong or misleading information, and existing confidence measures aren't reliable in these ongoing interactions.
What's the solution?
The researchers created a way to systematically test how LLM confidence changes during a conversation. They developed new ways to measure confidence, focusing on whether the model's confidence matches its actual accuracy (calibration) and whether its confidence increases as more information is given (monotonicity). They also built a method to create specific conversation examples to test these things. They found that most current confidence techniques don't work well in conversations and proposed a new method, P(Sufficient), that's a bit better, but still not perfect.
Why it matters?
This work is important because it provides a foundation for building more trustworthy conversational AI. If we can accurately measure an LLM's confidence, we can create systems that are more reliable, admit when they're unsure, and ultimately be more helpful and safe for users.
Abstract
While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new "Hinter-Guesser" paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.