Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Yujiao Yang, Jing Lian, Linhui Li
2025-03-07
Summary
This paper talks about Union-of-Experts (UoE), a new way to make AI models work better and faster by breaking them into smaller parts that work together more efficiently
What's the problem?
Current AI models using Mixture-of-Experts (MoE) don't work well together and can't be used in all parts of the model, which limits how much they can improve
What's the solution?
The researchers created UoE, which splits the AI model into equal parts called experts. They made new ways for these experts to work together and choose which parts of the input to focus on. They also figured out how to use this method in different parts of the AI model, including the attention blocks, which are important for understanding context
Why it matters?
This matters because it makes AI models faster and better at tasks like understanding images and language. By improving how different parts of the model work together, UoE could lead to more efficient and powerful AI systems that can handle complex tasks with less computing power
Abstract
Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.