Progressive Autoregressive Video Diffusion Models
Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, Yang Zhou
2024-10-13

Summary
This paper introduces Progressive Autoregressive Video Diffusion Models, a new method for generating longer and higher-quality videos using advanced AI techniques.
What's the problem?
Current video diffusion models can only create short video clips, typically around 10 seconds or 240 frames, because of the heavy computational demands during training. This limits their usefulness for applications that require longer videos.
What's the solution?
To solve this issue, the authors propose a method that allows existing video diffusion models to generate longer videos without changing their basic structure. They do this by applying progressively increasing levels of noise to the video frames instead of using a single noise level. This technique helps maintain the quality of the video and ensures smooth transitions between frames. By using this progressive approach, their models can generate videos that are one minute long (1440 frames at 24 frames per second) while preserving high quality.
Why it matters?
This research is significant because it expands the capabilities of video generation models, enabling them to create longer and more detailed videos. This advancement could have a big impact on various fields such as filmmaking, gaming, and virtual reality, where high-quality, extended video content is essential.
Abstract
Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained condition among the latents and large overlaps between the attention windows. Such progressive video denoising allows our models to autoregressively generate video frames without quality degradation or abrupt scene changes. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Videos from this paper are available at https://desaixie.github.io/pa-vdm/.