VIMI: Grounding Video Generation through Multi-modal Instruction

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov

2024-07-10

VIMI: Grounding Video Generation through Multi-modal Instruction

Summary

This paper talks about VIMI, a new model designed to generate videos based on multiple types of instructions, such as text and images. It aims to improve how videos are created by grounding them in rich, multimodal inputs.

What's the problem?

The main problem is that existing text-to-video models usually rely only on text for training, which limits their ability to understand and generate videos accurately. This lack of diverse input data means these models struggle with visual grounding, making it difficult for them to create videos that match complex instructions.

What's the solution?

To solve this issue, the authors created a large-scale multimodal prompt dataset that combines different types of inputs. They used a two-stage training process: first, they pre-trained the model using this new dataset to establish a strong foundation for video generation. Then, they fine-tuned the model on specific video generation tasks while incorporating multi-modal instructions. This approach allows VIMI to effectively integrate various types of information and produce videos that are rich in context and detail.

Why it matters?

This research is important because it enhances the capabilities of AI in generating videos that are not only accurate but also tailored to specific instructions. By using multiple input types, VIMI can create more engaging and relevant video content, which has applications in education, entertainment, and many other fields where visual storytelling is essential.

Abstract

Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This process further refines the model's ability to handle diverse inputs and tasks, ensuring seamless integration of multi-modal information. After this two-stage train-ing process, VIMI demonstrates multimodal understanding capabilities, producing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure 1. Compared to previous visual grounded video generation methods, VIMI can synthesize consistent and temporally coherent videos with large motion while retaining the semantic control. Lastly, VIMI also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.

View Paper