WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen

2025-05-27

WINA: Weight Informed Neuron Activation for Accelerating Large Language
Model Inference

Summary

This paper talks about WINA, a new method that helps large language models, like those used for chatbots or AI writing tools, work faster and more efficiently without needing extra training. WINA does this by only turning on the most important parts of the model during use, which saves computer power while still keeping answers accurate.

What's the problem?

The problem is that large language models usually need a lot of memory and computing resources because they activate all their neurons, or processing units, even if many of them don't contribute much to the final answer. Previous methods tried to fix this by only looking at how strong each neuron's signal was, but they ignored how much each neuron's output actually matters to the next steps in the model, which led to less accurate results.

What's the solution?

The authors created WINA, which is a training-free framework that decides which neurons to activate by looking at both how strong their signals are and how important their connections are in the model. By using both of these factors, WINA can choose the most influential neurons, leading to better accuracy and lower memory use compared to older methods.

Why it matters?

This is important because it means large language models can run faster and use less computer power, making them more practical for everyday use, especially on devices that don't have a lot of memory or processing speed. This helps bring powerful AI tools to more people and situations without sacrificing quality.

Abstract

WINA, a training-free sparse activation framework for large language models, improves inference accuracy by considering hidden state magnitudes and weight matrix norms, outperforming existing methods.

View Paper