Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu

2025-12-15

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Summary

This paper focuses on making videos generated by computers look more realistic, specifically when it comes to things that move and change shape, like people or animals. They've created a new system called SAM2VideoX that improves the quality and believability of generated motion.

What's the problem?

Current computer-generated videos often struggle with realistic movement. Simply using more training data doesn't fix the issue of physically impossible or unnatural motions. Existing methods rely on imperfect information about how things move, like estimations of how pixels shift or simplified 'skeleton' representations, which can introduce errors and limit realism. The core issue is preserving both the details and the overall structure of motion.

What's the solution?

The researchers developed SAM2VideoX, which combines two key ideas. First, they take information from a very accurate motion tracking model (SAM2) to provide a strong understanding of how things *should* move. Second, they use a special 'loss function' called Local Gram Flow that ensures the small details of movement stay consistent and realistic. This essentially teaches the video generator to respect the underlying structure of motion.

Why it matters?

This work is important because it significantly improves the realism of computer-generated videos, especially for complex movements. The improvements are noticeable both in automated tests and when people actually watch the videos, showing a clear preference for the new method. Better video generation has applications in areas like special effects, virtual reality, and robotics, making these technologies more immersive and useful.

Abstract

Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .

View Paper