Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Ailin Deng, Tri Cao, Zhirui Chen, Bryan Hooi
2025-03-11
Summary
This paper talks about AI systems that use both images and text, which tend to trust written words too much even when the picture shows something different, like believing a label says 'broccoli' when the photo clearly shows peppers.
What's the problem?
These AI models often ignore visual evidence if the text disagrees, leading to mistakes like misidentifying objects or falling for misleading descriptions, which could be dangerous in real-world applications like self-driving cars.
What's the solution?
Researchers suggest training the AI with more examples where text and images contradict each other, and adjusting how the AI processes text and visual data to balance their importance.
Why it matters?
Fixing this helps make AI safer and more reliable for tasks where accurate understanding of both images and text matters, like medical diagnosis or autonomous vehicles, reducing risks from incorrect decisions.
Abstract
Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a ``blind faith in text'' phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.