FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, Kwan-Yee K. Wong
2024-10-28

Summary
This paper introduces FasterCache, a new method to speed up video generation from diffusion models without needing additional training, while maintaining high video quality.
What's the problem?
Video diffusion models can create high-quality videos, but generating them can be slow and computationally expensive. Traditional methods that reuse features from previous steps often lead to lower video quality because they miss important details. There is a need for a more efficient way to generate videos that preserves quality while speeding up the process.
What's the solution?
The authors propose FasterCache, which improves the process of generating videos by using a smart feature reuse strategy. Instead of directly reusing features from adjacent steps (which can degrade quality), FasterCache uses a method called classifier-free guidance (CFG) to optimize how features are reused. This approach maintains the distinctiveness of features and keeps the flow of time in the video consistent. The results show that FasterCache can generate videos 1.67 times faster than existing methods while keeping the quality comparable to traditional approaches.
Why it matters?
This research is significant because it demonstrates how to make video generation faster and more efficient without sacrificing quality. By improving the speed of video generation, FasterCache could benefit various applications in entertainment, education, and content creation, making it easier to produce high-quality videos quickly.
Abstract
In this paper, we present \textit{FasterCache}, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that directly reusing adjacent-step features degrades video quality due to the loss of subtle variations. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (\eg 1.67times speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.