Motion Attribution for Video Generation

Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine

2026-01-14

Summary

This research focuses on understanding how the data used to train video generation models affects the quality of motion in the videos they create.

What's the problem?

While video generation is getting really good, it's not fully understood which specific training videos are most responsible for creating realistic and smooth movement. Existing methods usually focus on how things *look* visually, not how they *move* over time, and don't scale well to the huge datasets and complex models used today. This makes it hard to improve the motion quality of generated videos.

What's the solution?

The researchers developed a new tool called Motive that pinpoints which training videos have the biggest impact on the motion in generated videos. It does this by focusing on movement specifically, separating it from just the appearance of objects. Motive efficiently analyzes large datasets and identifies clips that either improve or worsen the flow of motion. They then used Motive to select a new set of training videos and fine-tuned their model with this curated data.

Why it matters?

This work is important because it's the first to directly address motion quality when training video generation models. By understanding which data points are crucial for good motion, they were able to improve the smoothness and realism of the generated videos, achieving a significant preference win in human evaluations. This opens the door to creating more believable and engaging videos by carefully selecting the data used for training.

Abstract

Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

View Paper