SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models
Xun Liang, Hanyu Wang, Huayi Lai, Simin Niu, Shichao Song, Jiawei Yang, Jihao Zhao, Feiyu Xiong, Bo Tang, Zhiyu Li
2025-03-11
Summary
This paper talks about SEAP, a tool that makes large language models (like chatbots) run faster by cutting out unnecessary parts of their 'brain' without retraining, based on what each task actually needs.
What's the problem?
Big AI models are slow and expensive to run because they use every part of their network for every task, even when many parts aren’t needed for specific jobs like answering math questions or writing stories.
What's the solution?
SEAP analyzes how the model’s brain cells (neurons) light up during different tasks, keeps only the important ones active, and shuts off the rest—like using a spotlight to focus on key tools in a cluttered workshop.
Why it matters?
This makes AI cheaper and faster to run, helping smaller devices use powerful models and reducing energy costs for big companies while keeping the AI just as smart at its job.
Abstract
Large Language Models have achieved remarkable success across various natural language processing tasks, yet their high computational cost during inference remains a major bottleneck. This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Inspired by the clustering patterns of hidden states and activations in LLMs, SEAP identifies task-specific expert activation patterns and prunes the model while preserving task performance and enhancing computational efficiency. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy. Notably, at 50% pruning, SEAP surpasses both WandA and FLAP by over 20%, and at 20% pruning, it incurs only a 2.2% performance drop compared to the dense model. These findings highlight SEAP's scalability and effectiveness, making it a promising approach for optimizing large-scale LLMs.