LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
2026-03-20
Summary
This paper introduces a new way to test how well artificial intelligence models understand long videos and audio, going beyond the typical short clips used for testing.
What's the problem?
Current AI models that can process both audio and video, called OmniLLMs, are usually tested on short clips – think 10 seconds to 5 minutes. This doesn't reflect real-world situations where videos are often much longer, like lectures or movies. Because of this, we don't really know how well these models can understand things that happen over a longer period of time, remember details from earlier in the video, or connect what's happening visually with what's being said.
What's the solution?
Researchers created a new dataset called LVOmniBench. It includes 275 videos ranging from 10 to 90 minutes long, along with over 1,000 questions and answers about the content. They then tested several existing OmniLLMs on this dataset to see how well they performed. The tests focused on things like remembering events, pinpointing when things happen in the video, understanding details, and combining information from both the audio and video.
Why it matters?
This work is important because it shows that current AI models struggle with long-form audio and video. While some models like Gemini 3 Pro do better, even they aren't very accurate. By creating this challenging dataset, the researchers hope to encourage the development of better AI models that can truly understand complex, real-world videos and audio.
Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.