The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents
Weihao Xuan, Qingcheng Zeng, Heli Qi, Yunze Xiao, Junjue Wang, Naoto Yokoya
2026-01-14
Summary
This paper investigates how well AI agents, powered by large language models and using tools like web search and code interpreters, can accurately express how confident they are in their answers. It's about making these agents 'trustworthy' by ensuring their confidence levels match their actual performance.
What's the problem?
When AI agents use tools, their confidence can be way off. Searching the web, for example, often gives noisy or unreliable information, leading the agent to be *overconfident* in incorrect answers. While tools like code interpreters, which give definite right or wrong answers, help the agent be more accurate in its confidence. The core issue is that existing AI models aren't consistently well-calibrated when they're actively using tools to solve problems, and we don't fully understand why.
What's the solution?
The researchers developed a way to 'train' these AI agents using reinforcement learning. This training process doesn't just focus on getting the right answer, but *also* on making sure the agent's stated confidence level accurately reflects how likely it is to be correct. They tested different reward systems during training and found that this approach significantly improved calibration, meaning the agents were better at knowing when they were unsure. This training also helped the agents perform well even when faced with messy real-world data like web searches and in different areas like math problems.
Why it matters?
This work is important because as AI agents become more common in real-world applications – like giving advice or making decisions – we need to be able to trust them. Knowing how confident an agent is in its answer is crucial for responsible use. This research provides a foundation for building AI agents that are not only smart but also 'self-aware' enough to communicate their uncertainty, which is vital for high-stakes situations.
Abstract
Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent's ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.