ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Ziteng Wang, Jianfei Chen, Jun Zhu

2024-12-25

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Summary

This paper talks about ReMoE, a new model that improves how Mixture-of-Experts (MoE) systems work by using a fully differentiable routing method to enhance performance in complex tasks.

What's the problem?

Mixture-of-Experts models are designed to make AI systems more efficient by activating only certain 'expert' models for specific tasks. However, traditional methods for choosing which experts to use (like the TopK router) are not very effective because they don't allow for smooth adjustments during training. This can limit the performance and scalability of these models, making it hard for them to handle complex reasoning tasks.

What's the solution?

The authors propose ReMoE, which replaces the conventional TopK routing with a method that uses ReLU (a type of mathematical function) for routing. This change allows the model to continuously adjust how it selects experts, making it more flexible and efficient. ReMoE also includes techniques to balance the workload among different experts and maintain consistent performance across various tasks. The experiments show that ReMoE outperforms traditional MoE models in terms of accuracy and scalability.

Why it matters?

This research is important because it enhances the capabilities of AI systems by improving how they utilize their resources. By making Mixture-of-Experts models more effective, ReMoE can lead to better performance in applications that require complex reasoning, such as natural language processing and decision-making tasks, while still being efficient in terms of computation.

Abstract

Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

View Paper