History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann

2025-02-11

Summary

This paper talks about History-Guided Video Diffusion, a new method for creating high-quality videos using AI. It introduces a system called Diffusion Forcing Transformer (DFoT) that can handle different amounts of past video frames, making the generated videos more consistent and realistic.

What's the problem?

Current AI models for video generation struggle when they have to use different numbers of past frames as context. This makes it hard for them to create smooth and natural-looking videos, especially when working with longer sequences.

What's the solution?

The researchers developed DFoT, a flexible architecture that allows AI models to use any number of previous frames as context. They also introduced History Guidance, a set of techniques that improve video quality and motion consistency by guiding the model based on the history of frames. These methods enable the model to create longer videos that look better and flow more smoothly.

Why it matters?

This matters because it makes AI video generation more reliable and versatile, allowing for the creation of longer and higher-quality videos. This could be useful in areas like filmmaking, animation, or virtual reality, where realistic and consistent video content is essential.

Abstract

Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla <PRE_TAG>history guidance</POST_TAG>, already significantly improves video generation quality and temporal consistency. A more advanced method, <PRE_TAG>history guidance</POST_TAG> across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution <PRE_TAG>history</POST_TAG>, and can stably roll out extremely long videos. Website: https://boyuan.space/history-guidance

View Paper