VidTwin: Video VAE with Decoupled Structure and Dynamics
Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian
2024-12-26

Summary
This paper talks about VidTwin, a new type of video autoencoder that separates video data into two parts to improve the efficiency and quality of video generation.
What's the problem?
Creating videos using traditional video autoencoders often requires a lot of computational power because they use large models with millions or even billions of parameters. This can make the process slow and resource-intensive, especially when trying to generate high-quality videos.
What's the solution?
The authors introduce VidTwin, which splits video data into two different types of information: Structure latent vectors, which capture the overall content and movement in the video, and Dynamics latent vectors, which focus on fine details and quick movements. By using a special structure called an Encoder-Decoder backbone along with two submodules, VidTwin efficiently processes videos while only using 45 million parameters. This results in high-quality video generation with a compression rate of just 0.20%, meaning it reduces the amount of data needed significantly while still maintaining good quality.
Why it matters?
This research is important because it makes video generation more efficient and accessible. By reducing the computational load while still producing high-quality results, VidTwin can help improve applications in various fields such as animation, gaming, and content creation, allowing for faster processing and better use of resources.
Abstract
Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.