Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He
2026-04-08
Summary
This paper introduces a new, more challenging benchmark called Video-MME-v2 for testing how well computers understand videos, aiming to get a more realistic measure of their abilities than current tests provide.
What's the problem?
Existing tests for video understanding are becoming too easy, meaning models can get high scores without actually *understanding* the video very well. There's a disconnect between how well models perform on these tests and how they perform in real-world situations. Current evaluations often just check if an answer is right or wrong, without verifying if the reasoning behind it makes sense.
What's the solution?
The researchers created Video-MME-v2, which tests video understanding in three increasing levels of difficulty: first, putting together information from different parts of a video, then understanding how things change over time, and finally, combining visual information with other types of information like text. They also developed a new way to score answers that doesn't just look for correct answers, but checks if the reasoning used to get there is consistent and logical, penalizing guesses. The benchmark was created with a lot of human effort to ensure the questions and answers are high quality.
Why it matters?
Video-MME-v2 highlights that even the most advanced AI models, like Gemini-3-Pro, still have significant gaps in their video understanding compared to humans. It pinpoints specific areas where models struggle, like combining visual information and tracking changes over time. This new benchmark will help push the development of better, more reliable video understanding AI by providing a more demanding and realistic test.
Abstract
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human-hours and up to 5 rounds of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.