Still-Moving: Customized Video Generation without Customized Video Data

Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri

2024-07-11

Still-Moving: Customized Video Generation without Customized Video Data

Summary

This paper talks about Still-Moving, a new method for customizing video generation models without needing special video data. It builds on existing text-to-image models to create personalized and stylized videos using only still images.

What's the problem?

The main problem is that while there has been a lot of progress in customizing text-to-image (T2I) models, applying these techniques to video generation is still challenging. This is mainly because there isn't enough customized video data available, which makes it hard to create videos that reflect specific styles or subjects.

What's the solution?

To address this issue, the authors introduced Still-Moving, which allows for the customization of text-to-video (T2V) models using only still images. They do this by using lightweight tools called Spatial Adapters that adjust the features from the T2I model when plugged into the T2V model. They also created a Motion Adapter that helps train these models on 'frozen videos'—which are just repeated images—while keeping the motion characteristics of the original video model intact. When generating videos, they remove the Motion Adapter and keep only the Spatial Adapters to ensure that the videos maintain both the style from the still images and the motion from the video model.

Why it matters?

This research is important because it enables easier and more effective customization of video content without needing extensive video data. By allowing users to create personalized and stylized videos from just a few images, Still-Moving can enhance various applications in entertainment, marketing, and social media, making it easier for creators to produce engaging visual content.

Abstract

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

View Paper