VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li
2024-12-30

Summary
This paper talks about VideoMaker, a new method for generating customized videos of specific subjects without needing extra training for each new subject.
What's the problem?
Existing methods for creating customized videos often rely on additional models to extract features from reference images, which can lead to inconsistencies in how the subject appears in the generated videos. These methods typically struggle to maintain the subject's appearance and require a lot of computational resources, making them inefficient and complicated.
What's the solution?
To solve this problem, the authors introduce VideoMaker, which uses the inherent capabilities of a Video Diffusion Model (VDM) to extract and inject subject features directly from reference images. Instead of needing extra models, VideoMaker inputs reference images into the VDM, allowing it to automatically extract detailed features. It then uses a technique called spatial self-attention to ensure that these features are accurately integrated into the generated video, maintaining the subject's appearance while also allowing for diverse video content. This approach simplifies the process and improves the quality of the generated videos.
Why it matters?
This research is important because it makes it easier and more efficient to create customized videos of different subjects without needing extensive training or additional models. By improving how AI can generate personalized video content, VideoMaker has potential applications in fields like entertainment, advertising, and education, where tailored visual content is increasingly in demand.
Abstract
Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video.Experiments on both customized human and object video generation validate the effectiveness of our framework.