PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

2025-05-28

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and
Retrieval

Summary

This paper talks about PreMoe, a new method that makes it possible to use large Mixture of Experts (MoE) language models even when you don't have a lot of computer memory available.

What's the problem?

The problem is that MoE language models are very powerful but usually need a lot of memory to run, which makes them hard to use on smaller devices or in places where computer resources are limited.

What's the solution?

To solve this, the researchers created PreMoe, which works by trimming down the number of 'experts' in the model to only the ones needed for a specific task and then retrieving the right ones when they're needed. This makes the model much lighter and easier to use in tight memory situations.

Why it matters?

This is important because it means advanced AI models can be used on phones, tablets, or other devices that don't have much memory, making powerful language technology more accessible to everyone.

Abstract

PreMoe framework enables efficient deployment of large MoE language models in memory-constrained environments by pruning and retrieving task-specific experts.

View Paper