Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, Jiebo Luo, William Yang Wang, Hao Fei, Mong-Li Lee, Wynne Hsu

2025-09-16

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Summary

This paper focuses on a problem with new video understanding AI models called Large Video Models (LVMs): they sometimes 'hallucinate,' meaning they generate information that isn't actually present in the video they're analyzing.

What's the problem?

LVMs are getting better at understanding videos, but they frequently make things up that aren't there. This is a big issue because it makes the models unreliable and untrustworthy. Imagine an AI describing a video and claiming something happened that didn't – that could lead to misunderstandings or incorrect decisions.

What's the solution?

The researchers created a system called Dr.V to find these hallucinations. Dr.V works in stages, much like how a person understands a video. First, it looks at specific parts of the video, then it tracks things over time, and finally, it uses reasoning to check if the AI's description matches what's actually happening. They also built a large dataset, Dr.V-Bench, with lots of videos and detailed notes about what's in them, to help test and improve Dr.V.

Why it matters?

This work is important because it provides a way to diagnose and potentially fix the hallucination problem in video AI. By making these models more accurate and reliable, we can use them for important real-world applications like self-driving cars, video surveillance, and content analysis with more confidence.

Abstract

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

View Paper