Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
2025-01-22

Summary
This paper talks about a new way to train large AI models called Mixture-of-Experts (MoEs) more effectively. The researchers found a problem with how these models were being trained and came up with a solution that makes the AI smarter and more specialized in different areas.
What's the problem?
The current way of training MoEs uses something called Load-balancing Loss (LBL) to make sure all parts of the AI (called experts) are used equally. But this method looks at very small chunks of data at a time, which forces the AI to use all its experts for every little piece of information. It's like making a group of specialists work on every part of a project, even if some parts don't need all the specialists. This stops the AI from becoming really good at specific things, like understanding code or medical information.
What's the solution?
The researchers came up with a new way to calculate the Load-balancing Loss. Instead of looking at tiny bits of data (micro-batches), they look at much larger chunks (global-batches). This is like giving the group of specialists a whole project to work on together, rather than tiny tasks. They also added a step where different parts of the AI share information about which experts are being used. This new method allows the AI to use its experts more flexibly and become better at handling different types of information.
Why it matters?
This matters because it makes large AI models much better at what they do. The researchers tested their method on huge AI models (up to 42.8 billion parameters, trained on 400 billion pieces of text) and found that it made the AI perform better on both general tasks and specific jobs. Most importantly, it helped the AI become really good at handling different types of information, like code or scientific data. This could lead to AI systems that are not just smart in general, but also have deep expertise in specific areas, making them more useful for complex tasks in fields like science, technology, and medicine.
Abstract
This paper revisits the implementation of Load-balancing Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E sum_{i=1}^{N_E} f_i p_i, where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a micro-batch and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to 42.8B total parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.