Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin

2024-10-10

Pyramidal Flow Matching for Efficient Video Generative Modeling

Summary

This paper introduces Pyramidal Flow Matching, a new method for efficiently generating videos using a unique approach that breaks down the video creation process into multiple stages.

What's the problem?

Generating videos is a complex task that requires a lot of computational resources and data. Traditional methods often train models on low-resolution images first and then gradually increase the quality, which can lead to inefficiencies and difficulties in sharing knowledge between different stages of the process.

What's the solution?

The authors propose a unified pyramidal flow matching algorithm that simplifies video generation by creating videos in stages, starting with low-resolution versions and refining them over time. Only the final stage uses full resolution, which saves computational power while maintaining high quality. This method allows for better continuity between different stages and enables the model to generate high-quality videos more efficiently. The model can create 5- to 10-second videos at a resolution of 768p and 24 frames per second within a reasonable training time.

Why it matters?

This research is significant because it makes high-quality video generation more accessible and efficient, allowing creators to produce videos faster and with less computational cost. With its open-source nature, Pyramidal Flow can be used by developers, small businesses, and independent creators, potentially transforming how video content is created across various industries.

Abstract

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models will be open-sourced at https://pyramid-flow.github.io.

View Paper