Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
2025-12-30
Summary
This paper focuses on improving Mixture-of-Experts (MoE) models, which are a type of large language model. The core idea is to make sure the 'router' – the part of the model that decides which 'expert' handles a piece of information – actually sends information to the experts best suited to deal with it.
What's the problem?
MoE models can be really powerful, but they often don't work as well as they could because there's no strong connection between what the router *thinks* each expert is good at and what the experts *actually* do well. The router might send data to an expert that isn't really equipped to handle it, limiting the overall performance of the model. Previous attempts to fix this issue were computationally expensive, meaning they took a lot of processing power and time.
What's the solution?
The researchers introduced something called 'Expert-Router Coupling' (ERC) loss. Think of it like this: each expert gets a special 'embedding' – a numerical representation – that acts as a stand-in for all the information that expert usually handles. The ERC loss then makes sure two things happen. First, each expert gets more 'excited' (shows higher activation) when it sees its own stand-in embedding than when it sees the stand-ins of other experts. Second, each stand-in embedding makes its corresponding expert more 'excited' than any other expert. This forces the router to send information to the right experts and ensures those experts specialize in the types of data they receive. Importantly, this method is efficient because it only looks at a small, fixed number of calculations, regardless of how much data is being processed.
Why it matters?
This work is important because it provides a more efficient and effective way to train MoE models. By better aligning the router's decisions with expert capabilities, the models become more powerful and accurate. The ERC loss also gives researchers a way to monitor how specialized each expert is becoming during training, offering valuable insights into how these complex models work and how to improve them further.
Abstract
Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.