EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration

Daiqing Wu, Dongbao Yang, Can Ma. Yu Zhou

2025-12-19

EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration

Summary

This paper focuses on Visual Emotion Comprehension, which is teaching computers to understand the emotions expressed in images, similar to how people do. They've improved existing computer models that do this by adding a way for the model to show *how confident* it is in its emotion guess.

What's the problem?

Current computer models for understanding emotions in images usually just give one single answer – like saying an image shows 'happiness'. However, emotions are often subjective, meaning different people might see different emotions in the same picture. These models don't acknowledge this uncertainty and don't give any indication of *how sure* they are about their answer, which can be unreliable.

What's the solution?

The researchers developed a new system called EmoCaliber that builds on powerful existing models. They trained it in three steps: first to think through the image, then to actually *say* how confident it is in its emotion prediction, and finally to make sure that confidence level is accurate. This allows the model to not only predict an emotion but also to express how certain it is about that prediction.

Why it matters?

This work is important because it makes emotion-detecting computers more trustworthy. By providing a confidence score, users can better understand if the model's prediction is likely to be correct or if there are other possible interpretations. This is a step towards creating more reliable and helpful AI systems that can understand and respond to human emotions.

Abstract

Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs' self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.

View Paper