Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan
2025-10-29
Summary
This paper introduces a new way to improve the performance of image generation models, specifically Diffusion Transformers, by using a technique called Mixture-of-Experts (MoE). It addresses the challenges of applying MoE to images, where it hasn't worked as well as it does with text.
What's the problem?
While MoE has been successful in making large language models more powerful and efficient, it hasn't translated well to image processing. The issue is that images have a lot of similar information across different parts (like a blue sky continuing across the picture), and different parts of an image don't always have drastically different roles. This makes it hard for the 'experts' in the MoE system to specialize and become truly useful, unlike with text where each word carries a lot of unique meaning.
What's the solution?
The researchers developed a system called ProMoE, which uses a smarter way to decide which parts of an image each 'expert' should handle. It first divides the image into sections that need expert attention based on their function, and then further refines this division by grouping similar image parts together based on their content. This is done using 'prototypes' – essentially, representative examples of different image features. They also added a special training method to encourage experts to focus on distinct things and work well together.
Why it matters?
This work is important because it unlocks the potential of using MoE to build much larger and more capable image generation models. By overcoming the challenges of applying MoE to images, ProMoE allows for more efficient scaling of these models, leading to potentially higher quality and more detailed image generation. It's a step towards creating AI that can generate images as effectively as it processes text.
Abstract
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.