SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie
2025-11-07
Summary
This paper focuses on improving how well AI understands videos, specifically when it comes to figuring out where things are and how they move in space and time. It tackles the challenge of teaching AI to reason about spatial relationships within videos.
What's the problem?
Current AI models are good at generally understanding videos, but they struggle with tasks that require precise spatial reasoning – like understanding distances, viewpoints, or tracking objects as they move. A big issue is getting enough training data: creating videos with detailed spatial information is expensive and time-consuming, limiting the AI's ability to learn these skills effectively.
What's the solution?
The researchers created a system called SIMS-V that uses 3D simulations to automatically generate lots of video training data with perfect spatial information. They then experimented with different types of questions to ask the AI about these simulated videos, figuring out which questions were most helpful for learning spatial reasoning. They found that focusing on just three types of questions – measuring distances, understanding different viewpoints, and tracking objects over time – worked surprisingly well. They then used this focused training data to improve a video-understanding AI model, and it performed better than larger models and even some commercial AI systems on real-world spatial reasoning tasks.
Why it matters?
This work is important because it shows a way to efficiently teach AI spatial reasoning skills without needing huge amounts of expensive, real-world video data. By using simulations and focusing on the right types of questions, they were able to create a more capable and efficient AI that can better understand the visual world around us, which is crucial for things like robotics, self-driving cars, and virtual reality.
Abstract
Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.