MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha
2025-11-25
Summary
This paper focuses on improving how well Vision Language Models, which are good at understanding both images and text, can reason about physics in videos. Currently, these models struggle with understanding how things move and interact in a realistic way, limiting their ability to truly understand what's happening in videos, especially those created by AI.
What's the problem?
Vision Language Models are really good at basic video tasks, but they fall apart when you ask them questions that require understanding physics – things like predicting where an object will go after being thrown, or understanding why something falls. This is a big problem because it means they can't fully grasp real-world videos or even videos generated by artificial intelligence, and they can't create physically realistic videos themselves.
What's the solution?
The researchers developed a new method called MASS, which stands for Model-Agnostic Spatial-Temporal Signals. Essentially, they figured out how to give the Vision Language Model extra information about the 3D space and how objects are moving within it. They do this by encoding depth information and tracking objects' motions, then feeding this information into the model's language processing part. They also used a technique called reinforcement learning to help the model better connect the visual information with the text. To test this, they also created a large dataset called MASS-Bench, filled with videos and questions specifically designed to test physics understanding.
Why it matters?
This work is important because it significantly improves the ability of AI to understand and interact with the physical world as depicted in videos. By closing the gap in physics reasoning, these models become more reliable for applications like analyzing real-world events, understanding AI-generated content, and even creating realistic simulations or videos. The results show their method performs almost as well as the very best, but closed-source, AI models like Gemini.
Abstract
Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.