Generative Video Motion Editing with 3D Point Tracks
Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, Zhengqi Li
2025-12-02
Summary
This paper introduces a new way to edit the movement of both the camera and objects within a video, giving more control to video editors.
What's the problem?
Currently, editing motion in videos is difficult. Existing methods either don't understand the whole scene, leading to inconsistencies, or only allow for simple changes like moving an object from one place to another without detailed control. They struggle with complex movements and figuring out which objects are in front of or behind others.
What's the solution?
The researchers developed a system that uses both the original video and 3D 'tracks' of where objects and the camera are moving. These 3D tracks act like a guide, telling the system how to change the motion while keeping everything looking realistic. Using 3D tracks, instead of just 2D ones, helps the system understand depth and handle situations where objects overlap, making the edits more precise.
Why it matters?
This research is important because it opens up new possibilities for video editing. It allows for more complex and creative changes to motion, like manipulating both the camera and objects simultaneously, transferring motion from one video to another, or even deforming objects in a realistic way. This could be useful for filmmakers, animators, and anyone who wants to edit videos with greater control.
Abstract
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.