ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu

2025-09-03

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Summary

This paper investigates a specific type of error in video-understanding AI models, called 'hallucination,' where the AI generates information not actually present or supported by the video. It focuses on long videos and a new kind of hallucination that happens when the AI misinterprets how events connect over time.

What's the problem?

Current AI models that analyze videos sometimes make things up, meaning they describe things that aren't happening or aren't logically consistent with the video. Existing tests mostly look at short videos, and while they've identified some causes like the AI relying too much on its pre-existing knowledge or getting confused by missing information, they don't fully explain why these errors happen in longer, more complex videos. Specifically, the paper identifies 'Semantic Aggregation Hallucination' (SAH) – where the AI correctly understands individual moments in a video but incorrectly combines them to understand the overall event. This is a bigger problem in long videos because there's more to combine and more opportunities for errors.

What's the solution?

The researchers created a new testing benchmark called ELV-Halluc specifically designed for long-video hallucination, allowing them to study SAH in detail. They found that SAH increases as videos become more complex and when events change quickly. They then experimented with ways to fix this, discovering that improving how the AI understands the order of events (using positional encoding) and training it to better distinguish between different events (using a technique called DPO) helped reduce these errors. They created a dataset of 8,000 examples to help train and test these improvements, resulting in a significant reduction in SAH.

Why it matters?

This work is important because it pinpoints a specific weakness in video AI – its ability to understand events unfolding over time in long videos. By identifying SAH and developing methods to reduce it, the researchers are helping to build more reliable and accurate AI systems that can truly 'understand' what's happening in videos, which is crucial for applications like self-driving cars, video surveillance, and automated content analysis.

Abstract

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

View Paper