SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang

2025-10-10

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Summary

This paper introduces a new benchmark called SciVideoBench to better test how well artificial intelligence models can understand and reason about scientific videos.

What's the problem?

Current AI benchmarks for video understanding are too easy, focusing mostly on simply recognizing what's in a video rather than truly *understanding* the science happening within it. These benchmarks don't challenge AI to use complex reasoning skills or specialized scientific knowledge, so progress is slowing down because it's hard to tell how well these models are actually doing. They're hitting a ceiling where they perform well, but aren't actually demonstrating higher-level thinking.

What's the solution?

The researchers created SciVideoBench, a collection of 1,000 multiple-choice questions based on real scientific experiment videos from over 25 different fields of study. These questions aren't about just *seeing* what's happening, but require understanding the scientific concepts, paying close attention to details in the video over time, and using logic to arrive at the correct answer. They then tested several advanced AI models, including some of the best available, on this benchmark.

Why it matters?

This new benchmark is important because it shows that even the most advanced AI models still struggle with complex scientific video reasoning. It highlights areas where AI needs to improve to become a truly helpful tool for scientists, and provides a way to measure progress in developing AI that can actually assist with scientific discovery. It pushes the field towards building AI that can act as a 'co-scientist'.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

View Paper