WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, Li Yuan
2024-12-03

Summary
This paper introduces WF-VAE, a new method that improves video encoding by using wavelet transforms to enhance the efficiency of video variational autoencoders (VAEs) for generating and processing videos.
What's the problem?
As videos become longer and higher in quality, the traditional methods used to encode them into a simpler form (called latent space) become too slow and resource-intensive. This makes it hard to train models that can generate or analyze these videos effectively. Additionally, existing methods can lead to problems when processing long videos, causing gaps or inconsistencies in the data.
What's the solution?
WF-VAE addresses these issues by using a technique called wavelet transform, which breaks down videos into different frequency components. This allows the model to focus on important low-frequency information while reducing the overall computational load. The researchers also introduced a new method called Causal Cache, which helps maintain the consistency of data when processing long videos in chunks. This combination results in a model that is faster and uses less memory while still producing high-quality video outputs.
Why it matters?
This research is significant because it makes it easier and more efficient to work with high-quality video data, which is essential for many applications like video editing, gaming, and machine learning. By improving how videos are processed and generated, WF-VAE can help create better tools for analyzing and creating content, making advanced video technology more accessible.
Abstract
Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at https://github.com/PKU-YuanGroup/WF-VAE.