Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang
2026-01-05
Summary
This paper focuses on improving how well AI models understand videos, specifically addressing a problem where they sometimes 'hallucinate' or make things up that aren't actually happening in the video, especially when the video shows something unusual or defies common sense.
What's the problem?
Current AI models that process both images and text, called Multimodal Large Language Models, tend to rely too heavily on what they already 'know' from text data. This means if a video shows something unexpected, like a cat flying, the model might still answer questions about it as if cats can't fly, because that's what it learned from text. This happens because these models are trained on much more text than video, creating an imbalance. Getting enough examples of these unusual 'counterfactual' videos to train the AI is expensive and time-consuming.
What's the solution?
The researchers created a system called DualityForge that automatically generates these unusual video examples. It takes real videos and uses AI-powered editing to change them into counterfactual scenarios – like making the cat fly. Then, it creates questions and answers about both the original and edited videos. This paired data is used to train the AI model in a special way, called Duality-Normalized Advantage Training, which helps it learn to focus more on what it *sees* in the video and less on its pre-existing knowledge. They also released a large dataset, DualityVidQA, built using this method.
Why it matters?
This work is important because it makes AI models more reliable when understanding videos, especially in situations that aren't typical. By reducing these 'hallucinations,' the AI can be used for more accurate video analysis in areas like robotics, self-driving cars, and video surveillance, where making correct interpretations is crucial. The open-sourcing of the dataset and code will also help other researchers build on this work.
Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise ell_1 advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.