DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao
2025-10-22
Summary
This paper focuses on how well computers understand what's happening in videos where both the camera and objects are moving, a skill called dynamic spatial intelligence.
What's the problem?
Current artificial intelligence models, like those that combine vision and language, are really good at understanding pictures and videos where things are still. However, they struggle when things are moving around, especially when the person *watching* is also moving. They get confused about who is moving and how, and often make incorrect assumptions based on what they've been trained on instead of actually understanding the scene.
What's the solution?
The researchers created a new set of videos and questions, called DSI-Bench, specifically designed to test how well AI models understand these dynamic 3D situations. This set includes almost 1,000 videos and over 1,700 questions, covering different types of movement. They then tested 14 different AI models on this benchmark to see where they fell short.
Why it matters?
This work is important because understanding movement is crucial for AI to interact with the real world effectively. By identifying the weaknesses of current models with DSI-Bench, the researchers are pointing the way towards building more intelligent systems that can truly 'see' and understand what's happening around them, even when things are in motion.
Abstract
Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.