Motion Prompting: Controlling Video Generation with Motion Trajectories
Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, Deqing Sun
2024-12-04

Summary
This paper introduces Motion Prompting, a new method that allows users to control video generation by specifying motion trajectories, making it easier to create dynamic and expressive videos.
What's the problem?
Most existing video generation models primarily rely on text prompts to guide the creation of videos. However, these models often struggle to accurately capture complex movements and actions over time, which can lead to less engaging or realistic videos. This limitation makes it difficult for users to achieve the specific motion effects they want in their videos.
What's the solution?
To solve this problem, the researchers developed a video generation model that uses 'motion prompts'—these are detailed instructions about how objects should move throughout the video. Users can specify simple motion paths, and the system can also translate high-level requests (like 'move the camera around') into more detailed motion instructions through a process called motion prompt expansion. This flexible approach allows for better control over both object and camera movements, leading to more dynamic and visually appealing videos.
Why it matters?
This research is important because it enhances the ability of AI systems to create high-quality videos that reflect user intentions more accurately. By allowing for detailed control over motion, Motion Prompting can improve applications in filmmaking, animation, and other creative fields where movement plays a crucial role in storytelling and visual impact.
Abstract
Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: https://motion-prompting.github.io/