Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei
2024-07-16

Summary
This paper presents Q-Sparse, a new method for training large language models (LLMs) that allows them to operate more efficiently by using sparsity in their activations.
What's the problem?
Large language models require a lot of computational power and memory to function, which makes them expensive and slow to use, especially when they need to process large amounts of data. Traditional methods do not effectively reduce these demands, leading to challenges in deploying these models in real-world applications.
What's the solution?
Q-Sparse addresses this issue by enabling full sparsity in the activations of LLMs. This means that instead of using all the model's parameters all the time, Q-Sparse selectively activates only the most important ones, significantly reducing the amount of computation needed. It uses a technique called top-K sparsification to keep only the top-performing parameters during processing and employs a straight-through estimator during training to maintain performance. The results show that Q-Sparse can achieve similar performance to traditional models while being much more efficient, making it suitable for various training scenarios, including training from scratch or fine-tuning existing models.
Why it matters?
This research is important because it offers a way to make powerful AI models more accessible and practical by reducing their resource requirements. By improving efficiency, Q-Sparse can help lower costs and energy consumption associated with using LLMs, making it easier for developers and companies to implement advanced AI technologies in everyday applications.
Abstract
We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.