Trajectory Attention for Fine-grained Video Motion Control
Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
2024-12-02

Summary
This paper introduces trajectory attention, a new method that improves how videos are generated by allowing for precise control of camera movements, which is essential for creating customized visual content.
What's the problem?
Generating videos with accurate camera motion can be challenging. Existing methods often struggle to maintain smooth and consistent movements, leading to imprecise outputs. This is especially problematic when the camera needs to follow specific paths or trajectories, which can result in jarring or unnatural video effects.
What's the solution?
The authors propose a novel approach called trajectory attention, which enhances the video generation process by focusing on pixel trajectories—essentially the paths that pixels take across frames. This method works alongside traditional temporal attention, allowing both systems to help each other. By doing this, trajectory attention improves the accuracy of camera motion control while still generating high-quality video content. The approach also performs well even when only partial trajectory information is available.
Why it matters?
This research is important because it significantly enhances the ability to create videos with controlled camera movements, making it easier for creators to produce visually appealing and dynamic content. By improving how videos are generated, this method can be applied in various fields such as filmmaking, video games, and virtual reality, where precise camera control is crucial.
Abstract
Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when the trajectory is only partially available. Experiments on camera motion control for images and videos demonstrate significant improvements in precision and long-range consistency while maintaining high-quality generation. Furthermore, we show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing, where it excels in maintaining content consistency over large spatial and temporal ranges.