FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Schönfeld
2025-02-28
Summary
This paper talks about FlexiDiT, a new way to make AI models that create images and videos work more efficiently without losing quality. It's like teaching a smart artist to paint beautiful pictures using less paint and time.
What's the problem?
Current AI models that create images and videos, called Diffusion Transformers, are really good at what they do, but they use up a lot of computer power. This is because they use the same amount of resources for each step of creating an image, even when some steps don't need as much power.
What's the solution?
The researchers created FlexiDiT, which allows these AI models to be more flexible in how they use computer resources. FlexiDiT can adjust how much power it uses at each step of creating an image or video. This means it can use less power for simpler parts of the process and more power for the complex parts. They also found a way to convert existing AI models to use this new flexible approach.
Why it matters?
This matters because it makes AI that creates images and videos much more efficient. FlexiDiT can create the same high-quality images and videos while using 40% less computer power for images and up to 75% less for videos. This could make these AI tools faster and cheaper to use, which could lead to new and exciting applications in areas like art, entertainment, and even scientific visualization. It's a big step towards making advanced AI more accessible and practical for everyday use.
Abstract
Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed <PRE_TAG>compute budget</POST_TAG> per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into flexible ones -- dubbed FlexiDiT -- allowing them to process inputs at varying <PRE_TAG>compute budget</POST_TAG>s. We demonstrate how a single flexible model can generate images without any drop in quality, while reducing the required FLOPs by more than 40\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to 75\% less compute without compromising performance.