GRIN: GRadient-INformed MoE

Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

2024-09-19

Summary

This paper introduces GRIN (GRadient-INformed Mixture-of-Experts), a new training method for large language models that improves their efficiency and performance by using a smarter way to manage how different parts of the model work together.

What's the problem?

Mixture-of-Experts (MoE) models are designed to be more efficient by activating only a small number of 'expert' components at a time. However, traditional training methods struggle with this sparse computation because they can’t easily update the model's parameters when only some experts are used. This makes it hard to train these models effectively, which limits their potential.

What's the solution?

The researchers developed GRIN, which uses a technique called sparse gradient estimation to help the model learn better while still using only a few experts at a time. They created a specific architecture that allows the model to activate two out of sixteen experts for each input, significantly reducing the number of active parameters while maintaining high performance. Their experiments showed that GRIN outperforms other models in various tasks, such as coding and math, achieving impressive scores on several benchmarks.

Why it matters?

This research is important because it makes large language models more efficient and capable of handling complex tasks without needing as much computational power. By improving how these models learn and operate, GRIN could lead to advancements in AI applications like automated coding and problem-solving, making AI tools more accessible and effective for developers and researchers.

Abstract

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16times3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

View Paper