VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Dohun Lee, Bryan S Kim, Geon Yeong Park, Jong Chul Ye

2024-10-08

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Summary

This paper introduces VideoGuide, a new method for improving the performance of video diffusion models without needing to retrain them, by using a guiding model to enhance video quality and consistency.

What's the problem?

Creating high-quality videos from text descriptions is difficult, especially when trying to keep the movement and actions consistent throughout the video. Existing methods that aim to improve this consistency often end up sacrificing image quality or take too much time to compute, which is not practical for real-world applications.

What's the solution?

To solve these issues, the authors developed VideoGuide, which uses a pretrained video diffusion model as a 'teacher' or guide during the video generation process. Instead of retraining the original model, VideoGuide helps improve the quality of the generated videos by integrating information from the guiding model into its own denoising process. This method significantly enhances the temporal consistency and overall image quality of the videos produced.

Why it matters?

This research is important because it provides a practical and efficient way to create better videos without the need for extensive retraining. By leveraging existing models and improving their performance, VideoGuide can help content creators and developers produce high-quality videos more quickly and easily, making it valuable for applications in entertainment, education, and more.

Abstract

Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: http://videoguide2025.github.io/

View Paper