CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

2025-02-10

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

Summary

This paper talks about CMoE, a new method to make large AI models faster and more efficient by turning them into a mixture-of-experts (MoE) system. This approach reduces the time and resources needed for AI to process information while keeping its performance high.

What's the problem?

Large language models (LLMs) are very powerful but require a lot of computing power and memory to work. Most of their processing happens in dense networks, which are slow and inefficient because they activate all parts of the model, even when only some parts are needed.

What's the solution?

The researchers created CMoE, a framework that transforms dense models into MoE systems by grouping neurons based on how often they are used. They designed an efficient routing mechanism to activate only the necessary parts of the model during processing. This process doesn’t require starting from scratch and can quickly create a high-performing MoE model with minimal data and lightweight fine-tuning.

Why it matters?

This matters because it makes large AI models faster and more affordable to use, allowing them to run on smaller devices or with fewer resources. By improving efficiency without sacrificing quality, CMoE helps make advanced AI technology more accessible for real-world applications.

Abstract

Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead. Feed-forward networks (FFNs), which dominate LLM parameters, exhibit high activation sparsity in hidden neurons. To exploit this, researchers have proposed using a mixture-of-experts (MoE) architecture, where only a subset of parameters is activated. However, existing approaches often require extensive training data and resources, limiting their practicality. We propose CMoE (Carved MoE), a novel framework to efficiently carve MoE models from dense models. CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation. First, neurons are grouped into shared and routed experts based on activation rates. Next, we construct a routing mechanism without training from scratch, incorporating a differentiable routing process and load balancing. Using modest data, CMoE produces a well-designed, usable MoE from a 7B dense model within five minutes. With lightweight fine-tuning, it achieves high-performance recovery in under an hour. We make our code publicly available at https://github.com/JarvisPei/CMoE.

View Paper