Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
2025-12-25
Summary
This paper focuses on improving how well vision-language models, which are good at understanding images and text together, can reason about things changing in 3D space over time. It introduces a new collection of data and a model component designed to help these AI systems better understand dynamic scenes.
What's the problem?
Current vision-language models struggle with 'dynamic spatial reasoning' – understanding how the shape and relationships of objects change in a 3D environment as time passes. This is because it's hard to find enough good training data that shows these kinds of changes. Existing datasets are often limited in scope, don't focus enough on 3D information, or don't show complex interactions between objects.
What's the solution?
The researchers created 'DSR Suite,' which includes a new dataset called 'DSR-Train' for learning and 'DSR-Bench' for testing. They built a system that automatically creates multiple-choice questions about videos, using information about object shapes, movements, and camera angles. They also developed a 'Geometry Selection Module' (GSM) that helps the model focus on the most important 3D information when answering questions, preventing it from getting confused by irrelevant details. They then integrated this new data and module into an existing model, Qwen2.5-VL-7B.
Why it matters?
This work is important because improving dynamic spatial reasoning will allow AI systems to better understand the real world, which is constantly changing. This has applications in areas like robotics, self-driving cars, and video analysis, where understanding how objects move and interact is crucial. By providing a new dataset and a targeted model component, this research helps move the field closer to creating AI that can truly 'see' and understand dynamic 3D scenes.
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.