FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo
2025-02-10
Summary
This paper talks about FlashVideo, a new AI system for creating high-resolution videos from text descriptions. It uses a two-step process to make videos faster and more efficiently while keeping them realistic and detailed.
What's the problem?
Current AI models for text-to-video generation take too long and require a lot of computing power to produce high-quality, high-resolution videos. This makes them slow, expensive, and impractical for many real-world uses.
What's the solution?
FlashVideo splits the video-making process into two stages. In the first stage, it creates a low-resolution version of the video that focuses on aligning with the text description. In the second stage, it enhances this low-resolution video to high resolution using a flow-matching technique that avoids starting from scratch. This approach reduces the number of calculations needed and speeds up the process significantly without losing quality.
Why it matters?
This matters because it makes creating high-quality videos faster and cheaper, which is useful for industries like film, advertising, and content creation. It also allows users to preview videos before committing to full resolution, saving time and resources.
Abstract
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art <PRE_TAG>high resolution video generation</POST_TAG> with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .