Video Motion Transfer with Diffusion Transformers
Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, Fabio Pizzati
2024-12-11

Summary
This paper talks about DiTFlow, a new method for transferring the motion from one video to another using advanced AI techniques called Diffusion Transformers.
What's the problem?
While current methods for generating videos can create impressive results, they often rely on 2D signals, which are not effective for capturing the complex movements of objects in three-dimensional space. This limitation makes it difficult to accurately transfer motion between videos.
What's the solution?
The authors developed DiTFlow, which uses a technique called Attention Motion Flow (AMF) to analyze how objects move in a reference video. By extracting motion signals from this video, DiTFlow guides the generation of a new video to replicate the same movements. This method does not require extensive training and allows for quick adjustments to the motion patterns, making it possible to transfer motion without needing to retrain the model for each new video.
Why it matters?
This research is important because it enhances the ability of AI to create realistic videos by allowing the seamless transfer of motion from one video to another. This capability can be useful in various fields such as animation, gaming, and film production, where realistic movement is crucial for storytelling and visual effects.
Abstract
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.