On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models
Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun
2025-10-14
Summary
This paper investigates why large vision-language models, which are good at understanding both images and text, sometimes 'hallucinate' – meaning they describe objects that aren't actually present in the image. The researchers found a key reason for this and developed a way to reduce these false descriptions.
What's the problem?
Large vision-language models are powerful, but they often make up details that aren't in the original image. This 'object hallucination' is a big problem because it means the model isn't reliably understanding what it 'sees'. The researchers discovered that this happens because the part of the model that processes the image, called the vision encoder, sometimes isn't confident about what it's detecting. Specifically, some parts of the image processing create uncertain 'visual tokens' that lead to these incorrect descriptions.
What's the solution?
The researchers tackled this problem by focusing solely on improving the vision encoder. They figured out a way to identify those uncertain visual tokens – the ones that are likely causing hallucinations – by slightly disturbing the image and seeing how much the model's interpretation changes. Then, they masked these uncertain tokens during a key step in the image processing, effectively reducing their influence and preventing the model from 'imagining' things that aren't there. This masking happens within the vision encoder itself, before the information is combined with the language model.
Why it matters?
This work is important because it provides a targeted solution to a significant problem in vision-language models. By addressing the uncertainty within the image processing component, the researchers were able to substantially reduce hallucinations, making these models more reliable and trustworthy. This improvement can lead to better performance in applications like image captioning, visual question answering, and other tasks where accurate understanding of images is crucial.
Abstract
Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.