Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu
2025-01-24
Summary
This paper talks about Video-MMMU, a new way to test how well artificial intelligence (AI) models can learn from educational videos. It's like creating a special exam for AI that checks if they can understand and use information from videos as well as humans do.
What's the problem?
Current AI models, called Large Multimodal Models (LMMs), can work with different types of information like text, images, and videos. But we don't have a good way to test if these AIs can actually learn from videos the way humans do. It's like having a smart student who can watch educational videos, but we can't tell if they're really learning and understanding the material.
What's the solution?
The researchers created Video-MMMU, which is like a big test for AI. They collected 300 expert-level videos on topics like art, business, and science, and made 900 questions about them. These questions check if the AI can do three things: notice important information (Perception), understand the concepts (Comprehension), and use what they learned to solve new problems (Adaptation). They also came up with a way to measure how much the AI's knowledge improves after watching the videos.
Why it matters?
This matters because as AI becomes more advanced, we want it to be able to learn and adapt like humans do. If AI can effectively learn from videos, it could be used for things like personalized education or quickly training robots for new tasks. The study shows that current AI models aren't as good at learning from videos as humans are, especially when it comes to using that knowledge to solve new problems. This tells us that we need to work on making AI better at learning from videos, which could lead to smarter and more adaptable AI systems in the future.
Abstract
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.