FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li
2025-12-16
Summary
This paper introduces a new way to create large datasets for teaching computers to understand motion in videos, ultimately making them better at predicting what will happen next.
What's the problem?
Currently, teaching computers to understand how things move is difficult because there aren't enough large, detailed datasets available. Creating these datasets is expensive and time-consuming, as it usually requires people to manually label everything that's happening in the videos. This limits how well computers can learn to 'reason' about physical movements.
What's the solution?
The researchers developed a system called FoundationMotion that automatically builds these datasets. It first identifies and tracks objects in videos. Then, it uses powerful AI language models to write descriptions and questions about the motion, creating a wealth of information without needing human labeling. They then used this data to improve existing AI models, like NVILA-Video-15B and Qwen2.5-7B.
Why it matters?
This work is important because it provides a scalable way to create the data needed to significantly improve a computer’s ability to understand motion. Their models actually performed better than some of the most advanced, but closed-source, AI systems, showing that this automated approach is a real breakthrough for motion understanding and spatial reasoning.
Abstract
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.