VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng

2024-06-25

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Summary

This paper introduces VideoHallucer, a new benchmark designed to evaluate how well large video-language models (LVLMs) detect and handle hallucinations, which are instances where these models generate irrelevant or nonsensical content that doesn't match the actual video.

What's the problem?

As large language models have become more advanced, they can now understand videos. However, they often produce 'hallucinations,' meaning they create false or unrelated information that doesn't fit the video's context. This is a significant issue because it can lead to misunderstandings and incorrect interpretations of video content.

What's the solution?

The authors developed VideoHallucer to systematically assess these hallucinations in LVLMs. They categorized hallucinations into two main types: intrinsic (where the generated content contradicts the video) and extrinsic (where the content cannot be verified by the video). They used a method called adversarial binary VideoQA, which involves creating pairs of questions—one based on the actual video and another based on hallucinated content—to rigorously test how well models can identify these issues. Their evaluation of eleven LVLMs showed that most models struggle with hallucinations, particularly when it comes to recognizing extrinsic factual errors.

Why it matters?

This research is important because it highlights the challenges faced by current video-language models in accurately interpreting video content. By providing a comprehensive framework for detecting hallucinations, VideoHallucer aims to improve the reliability of these models, which is crucial for applications like video analysis and automated content generation. Enhancing model performance in this area can lead to better understanding and interaction with multimedia content.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by "hallucinations", where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models exhibit significant issues with hallucinations; ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.

View Paper