FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
Yu Lu, Yuanzhi Liang, Linchao Zhu, Yi Yang
2024-07-30

Summary
This paper introduces FreeLong, a new method for generating long videos without the need for extensive training. It improves the quality of long video generation by using a technique called SpectralBlend Temporal Attention to balance different video features.
What's the problem?
Creating long videos with high quality is challenging because most existing video generation models require a lot of computational power and data to train. When trying to extend short video models to generate longer videos, the quality often suffers, leading to issues like blurry images or inconsistent motion.
What's the solution?
To solve this problem, the authors developed FreeLong, which takes a short video model that has already been trained and adapts it for long video generation. They discovered that the main issue with quality comes from how certain details in the video are distorted when extended. FreeLong addresses this by blending low-frequency features (which capture overall video content) with high-frequency features (which focus on finer details) during the video creation process. This approach helps maintain both clarity and consistency in the generated long videos.
Why it matters?
This research is significant because it allows for the creation of high-quality long videos without needing extensive resources for training. By making it easier and cheaper to produce long videos, FreeLong can benefit various fields such as filmmaking, gaming, and education, where high-quality video content is essential.
Abstract
Video diffusion models have made substantial progress in various video generation applications. However, training models for long video generation tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video generation (e.g. 128 frames). Our preliminary observation has found that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Further investigation reveals that this degradation is primarily due to the distortion of high-frequency components in long videos, characterized by a decrease in spatial high-frequency components and an increase in temporal high-frequency components. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process. FreeLong blends the low-frequency components of global video features, which encapsulate the entire video sequence, with the high-frequency components of local video features that focus on shorter subsequences of frames. This approach maintains global consistency while incorporating diverse and high-quality spatiotemporal details from local videos, enhancing both the consistency and fidelity of long video generation. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes.