MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang

2025-12-18

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Summary

This paper introduces a new way to test how well artificial intelligence, specifically large models that understand both images and language, can grasp spatial relationships in videos. It's about making these AI assistants better at understanding the physical world around them.

What's the problem?

Currently, there isn't a good, all-encompassing test to see how well AI models truly understand what's happening in a video, especially when it comes to things like where objects are, how they move, and predicting what will happen next. Existing tests don't cover all the necessary skills for an AI to operate effectively in a real-world environment. It's hard to measure progress without a solid benchmark.

What's the solution?

The researchers created a benchmark called MMSI-Video-Bench. This benchmark includes over a thousand questions about over twelve hundred video clips from many different sources. These questions test four key abilities: understanding what's visible in the video (Perception), figuring out how to achieve goals within the video (Planning), predicting future events (Prediction), and comparing information across different videos (Cross-Video Reasoning). They also tested 25 different AI models with this benchmark and analyzed where they struggled.

Why it matters?

This work is important because it provides a standardized way to evaluate and improve AI's ability to understand and interact with the physical world. By identifying the specific areas where AI models fall short – like understanding geometry, motion, and long-term predictions – researchers can focus on developing better AI assistants for tasks like robotics, navigation, and even just helping people with everyday activities.

Abstract

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

View Paper