TinyFusion: Diffusion Transformers Learned Shallow
Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang
2024-12-03

Summary
This paper introduces TinyFusion, a new method for making diffusion transformers more efficient by removing unnecessary layers while maintaining their performance.
What's the problem?
Diffusion transformers are powerful tools for generating images, but they often have too many parameters, which makes them slow and requires a lot of computational resources. This can be a problem when trying to use these models in real-world applications, especially on devices with limited processing power.
What's the solution?
TinyFusion solves this problem by using a technique called depth pruning, which removes redundant layers from the diffusion transformers. It does this through a learning process that allows the model to recover its performance after being pruned. The researchers introduced a method that makes pruning learnable and optimizes how well the model performs after fine-tuning. This allows TinyFusion to create smaller models that run faster without losing quality in image generation. They tested TinyFusion on various benchmarks and found that it significantly outperformed existing methods while using fewer resources.
Why it matters?
This research is important because it makes advanced image generation technology more accessible and practical for everyday use. By improving the efficiency of diffusion transformers, TinyFusion can enable better performance on devices like smartphones and tablets, making it easier to create high-quality images in real-time applications such as mobile apps, gaming, and augmented reality.
Abstract
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2times speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at https://github.com/VainF/TinyFusion.