Training-free Long Video Generation with Chain of Diffusion Model Experts

Wenhao Li, Yichao Cao, Xie Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu

2024-08-27

Training-free Long Video Generation with Chain of Diffusion Model Experts

Summary

This paper presents ConFiner, a new framework for generating long videos efficiently by breaking the video creation process into simpler tasks and using multiple specialized models.

What's the problem?

Creating high-quality videos is complex and requires a lot of computing power, which can lead to high costs and less-than-ideal results. Current models often struggle with this complexity, making it hard to produce long videos without sacrificing quality.

What's the solution?

The authors developed ConFiner, which separates video generation into two main tasks: controlling the structure of the video and refining the details. They use a chain of specialized models, each focusing on a specific part of the video creation process. Additionally, they introduced a method called coordinated denoising that combines the strengths of these models to produce better results. The ConFiner-Long framework allows for generating longer videos, up to 600 frames, while being much more efficient—using only 10% of the resources needed by other models.

Why it matters?

This research is important because it makes it easier and cheaper to create high-quality videos, which can benefit industries like filmmaking and gaming. By improving how videos are generated, ConFiner opens up new possibilities for creative projects and enhances the overall quality of visual content.

Abstract

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose ConFiner, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure control and spatial-temporal refinement. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

View Paper