MotionClone: Training-Free Motion Cloning for Controllable Video Generation
Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin
2024-06-13

Summary
This paper introduces MotionClone, a new method for creating videos that can mimic motion from a reference video without needing extensive training. It allows users to generate videos based on text descriptions while controlling the motion patterns seen in the reference video.
What's the problem?
Traditional methods for generating videos based on motion often require training complex models, which can be time-consuming and may not work well outside their training scenarios. This leads to poor motion generation and limits flexibility when trying to create new video content.
What's the solution?
MotionClone offers a training-free approach that allows users to clone motion from a reference video directly. It uses a technique called temporal attention to focus on the important motions in the reference video while ignoring less relevant details. Additionally, it incorporates a location-aware guidance system that helps maintain spatial relationships in the generated video, ensuring that the motions align well with the text prompts provided by users. This combination allows for better control over both global camera movements and local object motions.
Why it matters?
MotionClone is significant because it simplifies the process of creating videos by removing the need for extensive training. This makes it easier for creators to generate high-quality videos that match their vision without being limited by previous methods. The ability to control motion effectively opens up new possibilities for various applications in video production, animation, and creative content creation.
Abstract
Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.