The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
2024-10-17

Summary
This paper discusses the issue of hallucinations in large multimodal models (LMMs), which are AI systems that can process and understand information from different types of data, like text, images, and audio.
What's the problem?
Despite advancements in LMMs, they often generate outputs that do not accurately reflect the input data, leading to hallucinations. This means the AI might create false or misleading information based on what it sees or hears, which limits its reliability in real-world applications.
What's the solution?
The authors conducted a systematic investigation into the causes of hallucinations in LMMs, focusing on three main types of data: language, visual, and audio. They identified two key reasons for these hallucinations: an overreliance on single-modality information and incorrect connections between different types of data. To address these issues, they introduced a new benchmark called The Curse of Multi-Modalities (CMM) to evaluate and analyze these hallucinations more effectively.
Why it matters?
This research is important because it highlights the need for better methods to reduce hallucinations in multimodal AI systems. By understanding and addressing these issues, we can improve the accuracy and reliability of AI applications across various fields, such as healthcare, autonomous vehicles, and content creation.
Abstract
Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.