Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang
2025-02-11
Summary
This paper talks about Efficient-vDiT, a new method to make AI-generated videos faster and more efficient without losing much quality. It focuses on improving a type of AI called Diffusion Transformers (DiTs) that can create high-quality videos.
What's the problem?
Current video-generating AI models, like the Open-Sora-Plan, are really slow. They take a long time to make even short videos because they have to do a lot of complex calculations and go through many steps to create each frame. For example, it takes more than 9 minutes to make a 29-frame video, which is way too slow for practical use.
What's the solution?
The researchers came up with two main ways to speed things up. First, they found a pattern in how the AI pays attention to different parts of the video and used this to simplify the calculations. Second, they shortened the process of creating the video by breaking it into smaller chunks and using a technique called consistency distillation. They combined these ideas into a three-stage training process. With these changes, they made the Open-Sora-Plan model up to 7.8 times faster at making videos.
Why it matters?
This matters because it could make AI-generated videos much more practical to use in real-world situations. Faster video generation means we could see more AI-created content in areas like entertainment, education, or advertising. It also shows that we can make AI models more efficient without sacrificing too much quality, which is important as we try to create more advanced AI systems that don't require as much computing power or energy.
Abstract
Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.