Video Diffusion Alignment via Reward Gradients

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, Deepak Pathak

2024-07-13

Summary

This paper discusses a new method called Video Diffusion Alignment via Reward Gradients (VADER) that improves how video generation models learn to create videos based on specific tasks. It focuses on adapting these models using pre-trained reward systems to enhance their performance.

What's the problem?

Video diffusion models, which are used to generate videos, often rely on large amounts of unsupervised data for training. However, when it comes to applying these models to specific tasks, the process can be difficult and time-consuming because it requires collecting and labeling new video datasets. This makes it challenging to fine-tune the models effectively for particular uses.

What's the solution?

The authors propose using pre-trained reward models that help guide the video diffusion models in learning more efficiently. By backpropagating gradients from these reward models—essentially using feedback on how well the generated videos meet certain criteria—the video diffusion models can be adapted without needing as much new data. This method allows for better alignment with the desired outcomes while reducing the computational effort needed.

Why it matters?

This research is important because it provides a more efficient way to train video generation models, making them more adaptable for various applications. By improving how these models learn from existing data and feedback, VADER can lead to higher quality video generation in fields like entertainment, education, and virtual reality, ultimately enhancing user experiences.

Abstract

We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we utilize pre-trained reward models that are learned via preferences on top of powerful vision discriminative models to adapt video diffusion models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to efficient learning in complex search spaces, such as videos. We show that backpropagating gradients from these reward models to a video diffusion model can allow for compute and sample efficient alignment of the video diffusion model. We show results across a variety of reward models and video diffusion models, demonstrating that our approach can learn much more efficiently in terms of reward queries and computation than prior gradient-free approaches. Our code, model weights,and more visualization are available at https://vader-vid.github.io.

View Paper