MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan

2025-01-22

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Summary

This paper talks about MMVU, a new way to test how well AI systems can understand videos across different subjects like science, healthcare, social studies, and engineering. It's like creating a super-advanced quiz for AI that covers many different topics and requires expert-level knowledge to answer correctly.

What's the problem?

Current ways of testing AI's ability to understand videos are too simple. They usually just check if the AI can see what's happening in the video, but don't test if the AI really understands the deeper meaning or can use expert knowledge to analyze what's going on. It's like only asking an AI to describe what it sees in a science experiment video, instead of asking it to explain the scientific principles behind the experiment.

What's the solution?

The researchers created MMVU, which includes 3,000 really tough questions about videos, covering 27 different subjects. These questions were written by experts in each field and require deep understanding to answer correctly. They made sure the questions were high quality by having strict checks. They also included explanations for the correct answers and relevant background information. Then, they tested 32 of the most advanced AI systems using MMVU to see how well they could handle these expert-level questions.

Why it matters?

This matters because as AI becomes more advanced, we need better ways to test what it can really do. MMVU helps us understand where AI is strong and where it still needs improvement when it comes to understanding complex videos. This could lead to developing AI that can truly understand and analyze videos like an expert in different fields, which could be useful in areas like education, scientific research, or medical diagnosis. Even though the best AI systems did well on the test, they still couldn't match human experts, showing there's still room for improvement in making AI smarter and more knowledgeable across different subjects.

Abstract

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

View Paper