MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu
2025-10-22
Summary
This paper introduces a new way to test how well AI models understand videos in a conversation. These AI models, called Multimodal Large Language Models, are getting better at 'seeing' and understanding images and videos, but current tests don't really challenge them with back-and-forth discussions like you'd have in real life.
What's the problem?
Existing tests for these AI video understanding systems only ask a single question at a time. Real-world situations usually involve a series of questions and answers, where each response builds on the previous one. The current tests don't accurately reflect how these AI models would perform in a more natural, interactive setting, like discussing a sports game or getting help with homework from a video.
What's the solution?
The researchers created a new benchmark called MT-Video-Bench. This benchmark includes almost a thousand conversations about videos, covering different topics. These aren't just simple questions; they're multi-turn dialogues that require the AI to remember previous parts of the conversation and use that information to answer new questions. They tested several existing AI models with this new benchmark to see how well they handle these complex conversations.
Why it matters?
This new benchmark is important because it provides a more realistic way to evaluate AI's video understanding abilities. By identifying the weaknesses of current AI models in handling multi-turn video dialogues, it helps researchers focus on improving these areas, ultimately leading to more helpful and intelligent AI systems that can truly understand and interact with the visual world around us.
Abstract
The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.