Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber
2024-08-02

Summary
This paper introduces Reenact Anything, a method for transferring motion from one video to another using a technique called motion-textual inversion. This allows for more complex and realistic video edits based on specific motion references.
What's the problem?
Most current video editing techniques focus on changing how things look but struggle with accurately transferring motion. Existing methods that use text or simple markers can only handle basic movements, which limits creativity and realism in video editing. This makes it hard to create dynamic videos that require complex actions or interactions.
What's the solution?
To solve this issue, the authors propose a new approach that uses a reference video to specify the desired motion. Instead of relying on text descriptions alone, their method leverages a pre-trained image-to-video model to maintain the original appearance of objects while separating their look from their movement. They represent motion using special tokens that capture the essence of the movement in the reference video. This allows them to apply those motions to different target images without needing them to be perfectly aligned, making it versatile for various tasks like reenacting full-body movements, facial expressions, or even controlling inanimate objects.
Why it matters?
This research is significant because it enhances the capabilities of video editing tools, allowing creators to produce more engaging and lifelike content. By enabling complex motion transfers with high accuracy, this method could revolutionize fields like film production, animation, and virtual reality, providing artists and developers with powerful new ways to tell stories visually.
Abstract
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.