Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

2024-06-17

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Summary

This paper introduces Vivid-ZOO, a new method for generating multi-view videos using diffusion models. It focuses on creating high-quality videos of dynamic 3D objects based on text descriptions, addressing challenges that have not been thoroughly explored before.

What's the problem?

While diffusion models have been successful in generating 2D images and videos, creating multi-view videos (which show an object from different angles) based on text prompts is still a difficult task. One major issue is the lack of large datasets with captioned multi-view videos, which makes it hard for AI to learn how to generate them accurately. Additionally, modeling the complex relationships between different views and time in these videos adds to the challenge.

What's the solution?

To tackle these challenges, the authors developed a new pipeline that breaks down the problem of generating multi-view videos into two parts: viewpoint (how the object looks from different angles) and time (how the object moves). This approach allows them to use existing advanced models for both multi-view images and 2D videos more effectively, reducing the amount of new training needed. They also created alignment modules to ensure that the different layers of these models work well together despite differences in how they process data. Furthermore, they provided a new dataset containing captioned multi-view videos to support future research.

Why it matters?

This research is important because it advances the field of video generation by enabling the creation of realistic multi-view videos from simple text prompts. By addressing the limitations of previous methods, Vivid-ZOO can improve applications in areas like virtual reality, gaming, and education, where seeing objects from multiple perspectives is crucial for understanding and engagement.

Abstract

While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

View Paper