MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Xiaodong Chen, Mingming Ha, Zhenzhong Lan, Jing Zhang, Jianguo Li

2025-08-12

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Summary

This paper talks about MoBE, a new method called Mixture-of-Basis-Experts that helps make large language models using Mixture-of-Experts architecture smaller without losing much accuracy. It changes how the expert parts of the model are designed by sharing common components while keeping unique parts for each expert.

What's the problem?

The problem is that big Mixture-of-Experts language models, while powerful, need huge amounts of memory and computing resources, which makes them hard to deploy and slow to run. Existing ways to shrink these models usually cause a significant drop in how well the model performs.

What's the solution?

MoBE solves this by breaking down each expert’s weight matrix into two parts: one small unique part per expert, and another larger part made by combining a few shared basis matrices used by all experts in the same layer. This efficient sharing lets the model reduce its size a lot while keeping accuracy loss very small. The model learns this factorization by minimizing the difference from the original weights.

Why it matters?

This matters because it makes very large language models easier to use in real life by reducing how much memory and computing power they need without sacrificing much accuracy. Smaller models mean faster, cheaper, and more practical AI systems that can still perform really well on complex tasks.

Abstract

A novel Mixture-of-Basis-Experts (MoBE) method is introduced to compress large language models with minimal accuracy loss.

View Paper