Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng, Can Chen, Youliang Yan, Ruiming Tang, Peifeng Qin, Xinghao Chen, Dacheng Tao, Yunhe Wang

2025-06-30

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

Summary

This paper talks about Pangu Pro MoE, a new approach for large language models that uses a system called Mixture of Grouped Experts (MoGE) to make the model work more efficiently on special hardware called Ascend NPUs.

What's the problem?

Big language models often need a lot of computer power and time to run, and sometimes their parts called experts are not used evenly, which slows everything down and wastes resources.

What's the solution?

Pangu Pro MoE groups the experts into clusters and improves how the system balances the work among these groups, so the model runs faster and uses computing power more effectively, especially on Ascend NPUs.

Why it matters?

This matters because it helps big language models become faster and cheaper to run, which makes it easier to use these powerful AI systems in real-world applications without needing huge amounts of expensive hardware.

Abstract

Mixture of Grouped Experts (MoGE) improves expert load balancing and execution efficiency for large language models, enhancing throughput and cost-to-performance on Ascend NPUs.

View Paper