Scaling Diffusion Transformers to 16 Billion Parameters

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

2024-07-17

Scaling Diffusion Transformers to 16 Billion Parameters

Summary

This paper introduces DiT-MoE, a new type of diffusion Transformer model that can handle 16 billion parameters efficiently while maintaining high performance in generating images.

What's the problem?

Large models often require a lot of computational resources and memory, making them difficult to use effectively. Traditional dense models can be slow and expensive to run, especially when generating images from complex data. Additionally, existing models may not efficiently utilize their components, leading to wasted resources and lower performance.

What's the solution?

DiT-MoE addresses these issues by using a sparse design that incorporates 'Mixture of Experts' (MoE) techniques. This means that instead of using all parts of the model at once, it selectively activates only certain experts based on the task at hand. The model uses two key features: shared expert routing to reduce redundancy and expert-level balance loss to improve performance. This allows DiT-MoE to generate high-quality images while using less computational power compared to traditional dense models.

Why it matters?

This research is important because it shows how we can create powerful AI models that are both efficient and effective. By scaling up to 16 billion parameters while optimizing resource use, DiT-MoE sets a new standard for image generation tasks. This could lead to advancements in various fields, such as computer graphics, video game design, and any area where high-quality image generation is needed.

Abstract

In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512times512 resolution settings. The project page: https://github.com/feizc/DiT-MoE.

View Paper