Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

2025-02-11

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale
Next-DiT

Summary

This paper talks about Lumina-Video, a new AI system that can create high-quality videos from text descriptions or images. It builds on existing technology for making realistic images and adds special features to handle the challenges of creating smooth, natural-looking videos.

What's the problem?

While AI has gotten really good at making realistic still images, making videos is much harder. Videos need to look good not just in one frame, but across many frames that flow smoothly together. Current systems struggle with this complexity, especially when trying to create longer videos that look natural and match what the user asked for.

What's the solution?

The researchers created Lumina-Video, which uses a special design called Multi-scale Next-DiT. This allows the AI to look at the video in different ways at the same time, from big picture stuff to tiny details. They also added a way to control how much movement is in the video and used clever training tricks to help the AI learn from both real and computer-generated videos. As a bonus, they made another system called Lumina-V2A that can add matching sounds to the videos.

Why it matters?

This matters because it could make it much easier for people to create high-quality videos without needing expensive equipment or lots of technical skills. It could be used for things like making educational videos, creating special effects for movies, or helping businesses make better marketing content. As videos become more important in how we communicate and share information online, tools like Lumina-Video could give more people the power to express their ideas visually.

Abstract

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale <PRE_TAG>Next-DiT</POST_TAG> architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

View Paper