BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun

2025-07-14

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with
Chunk-Level Activation Sparsity

Summary

This paper talks about BlockFFN, a new way to build parts of large language models called mixture-of-experts (MoE), designed to be faster and more efficient on smaller devices like phones or tablets.

What's the problem?

Big language models usually need a lot of computing power, and existing MoE designs have trouble being both flexible and fast enough to run well on devices with limited resources.

What's the solution?

The researchers created BlockFFN, which uses a smart routing system with special functions that let the model decide quickly which parts to activate. They also introduced training methods that encourage the model to be sparse in a way that works better for speeding up processing on end devices.

Why it matters?

This matters because it helps run powerful language models more quickly and efficiently on everyday devices, making advanced AI accessible without needing supercomputers.

Abstract

A novel MoE architecture, BlockFFN, with differentiable routing and CLS-aware training objectives, improves sparsity and acceleration for large language models on end-side devices.

View Paper