Large Motion Video Autoencoding with Cross-modal Video VAE

Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen

2024-12-24

Large Motion Video Autoencoding with Cross-modal Video VAE

Summary

This paper talks about a new video autoencoder that improves how videos are compressed and generated by addressing issues with motion blur and detail loss, while also using text information to enhance video quality.

What's the problem?

When trying to compress videos using traditional methods, applying image techniques to individual frames can lead to problems like motion blur and loss of important details. Existing video autoencoders have started to tackle these issues, but they often still struggle to accurately recreate high-quality videos.

What's the solution?

The authors propose a novel video autoencoder that uses a few key strategies: First, they separate the processes of spatial (how things look) and temporal (how things move) compression to avoid introducing blur. They also add a lightweight model specifically for handling motion, which helps improve how the video is encoded. Additionally, they use text information from text-to-video datasets to guide the encoding process, which helps preserve details and maintain stability in the video. Finally, they train the model on both images and videos together, making it versatile enough to handle both types of data effectively.

Why it matters?

This research is important because it enhances the ability of AI to create and compress high-quality videos more efficiently. By improving video encoding techniques, this work can lead to better streaming services, video games, and other applications where high-quality video playback is essential.

Abstract

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~https://yzxing87.github.io/vae/{https://yzxing87.github.io/vae/}.

View Paper