SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun

2024-12-18

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Summary

This paper talks about SUGAR, a new method that allows users to create customized videos based on a single image without needing any prior video footage.

What's the problem?

Creating personalized videos can be challenging because it usually requires a lot of existing video clips or footage of the subject you want to feature. Traditional methods often need detailed instructions and fine-tuning, which can be time-consuming and limit creativity.

What's the solution?

SUGAR solves this problem by using a zero-shot approach, meaning it can generate videos for any subject just from an input image and some text describing how the video should look (like style and motion). It builds a large synthetic dataset with millions of image-video-text combinations to train the model effectively. This allows SUGAR to produce high-quality videos that match the user's specifications without needing extra training or adjustments at the time of use.

Why it matters?

This research is important because it simplifies the process of video creation, making it accessible for anyone who wants to personalize content without extensive resources. By enabling easy customization, SUGAR can be useful in various fields such as marketing, entertainment, and education, allowing for more creative expression and personalized storytelling.

Abstract

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

View Paper