VEnhancer: Generative Space-Time Enhancement for Video Generation

Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, Ziwei Liu

2024-07-11

VEnhancer: Generative Space-Time Enhancement for Video Generation

Summary

This paper talks about VEnhancer, a new system designed to improve the quality of videos generated from text. It enhances both the details in the video and the smoothness of motion, making the final product look much better.

What's the problem?

The main problem is that many AI-generated videos can be low-quality, with unclear images and choppy movements. This happens because existing methods often struggle to create high-resolution videos that are visually appealing and free of glitches like flickering or blurriness.

What's the solution?

To solve this issue, the authors developed VEnhancer, which uses a special technique called a video diffusion model. This model can take a low-quality video and improve its resolution in both space (how clear the images are) and time (how smooth the motion appears). They also created a component called ControlNet that helps the model understand how to enhance videos based on their frame rate and resolution. Additionally, they designed methods to add variety to the training data, which helps make the model more effective during training.

Why it matters?

This research is important because it allows for significant improvements in how AI-generated videos look and feel. By enhancing video quality, VEnhancer can make AI tools more useful for creators in fields like film, gaming, and online content. It also helps existing text-to-video systems perform better, as seen with VideoCrafter-2 achieving top results in video generation benchmarks.

Abstract

We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.

View Paper