Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

Weihao Xuan, Qingcheng Zeng, Heli Qi, Junjue Wang, Naoto Yokoya

2025-05-27

Seeing is Believing, but How Much? A Comprehensive Analysis of
Verbalized Calibration in Vision-Language Models

Summary

This paper talks about how vision-language models, which are AI systems that understand both pictures and words, express their confidence in their answers and whether that confidence matches how accurate they really are.

What's the problem?

The problem is that these models often say they are very sure about their answers even when they're wrong, or they might not sound confident when they're actually right. This miscalibration makes it hard for people to trust the model's answers, especially in important situations.

What's the solution?

The authors studied how these models verbalize their confidence and found that their confidence levels often don't match their actual performance. They suggest that using reasoning methods tailored to each type of input (like images or text) and a new technique called Visual Confidence-Aware Prompting can help the models give more reliable and honest answers.

Why it matters?

This is important because it helps make AI systems more trustworthy and dependable, especially when people need to know how much they can rely on the model's answers in real-world tasks.

Abstract

The study evaluates verbalized confidence in vision-language models, revealing miscalibration across tasks and suggesting that modality-specific reasoning and Visual Confidence-Aware Prompting can improve reliability.

View Paper