OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang
2025-10-14
Summary
This paper introduces a new way to test how well artificial intelligence understands videos by looking at both what's happening visually and what's being said in the audio. It focuses on whether the AI can truly combine information from both senses to answer questions about the video.
What's the problem?
Current tests for AI video understanding aren't very good at checking if the AI is *actually* using both the video and the sound together. Many tests ignore one of the senses, or they ask questions that don't make logical sense when considering both audio and visuals. This means we don't really know how well AI can truly 'understand' a video like a human does.
What's the solution?
The researchers created a new benchmark called OmniVideoBench. This benchmark includes 1000 questions about 628 different videos, ranging in length. Each question has a detailed explanation of the reasoning needed to answer it correctly, and the questions cover a wide range of skills like understanding what happens when, where things are located, counting objects, figuring out cause and effect, and summarizing the video. They then tested several AI models on this benchmark.
Why it matters?
This work is important because it shows that current AI models still struggle with truly understanding videos by combining audio and visual information. The new benchmark will help researchers develop better AI that can reason about videos more like humans, leading to more capable and reliable AI systems.
Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.