ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang

2025-10-15

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Summary

This paper introduces a new way to test how well artificial intelligence models, specifically those that can understand both images and text, can grasp the details of scientific experiments shown in videos.

What's the problem?

Current tests for these AI models don't accurately reflect the complexity of real lab work, especially in fields like chemistry or biology where things are done hands-on. Existing benchmarks focus on simple recognition, but don't challenge the AI to understand the *process* of an experiment, the order of steps, or how the experiment connects to its overall scientific goal. They miss the subtle details and long-term tracking needed to truly understand what's happening in a lab.

What's the solution?

The researchers created a new benchmark called ExpVid, which uses videos of actual scientific experiments taken from published research. ExpVid tests the AI on three levels: recognizing tools and actions, understanding the correct order of steps, and finally, connecting the experiment to the scientific conclusions presented in the paper. They used a combination of automatic tools and expert scientists to carefully label the videos, ensuring the AI needs to really *see* and understand what's happening. They then tested 19 different AI models on this benchmark.

Why it matters?

This work is important because it shows that current AI models, while good at recognizing basic things, struggle with the more complex reasoning needed to understand and potentially assist with scientific research. It highlights a gap between the performance of privately owned AI models and those available for public use, and provides a clear path for improving AI so it can become a reliable partner for scientists in the lab.

Abstract

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

View Paper