Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun

2024-11-05

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Summary

This paper discusses Sparsing Law, a study focused on improving large language models (LLMs) by increasing their activation sparsity, which means reducing the number of unnecessary computations during processing. This can make LLMs more efficient and easier to interpret.

What's the problem?

Current LLMs often have many elements in their activation outputs that don't contribute much to their performance. This can lead to inefficiencies and slow down the models. Understanding how to effectively increase activation sparsity is important, but there hasn't been enough detailed research on how different factors affect this sparsity in LLMs.

What's the solution?

The authors conducted a comprehensive study to explore how activation sparsity behaves in decoder-only Transformer-based LLMs. They introduced a new metric called PPL-p% sparsity to measure activation sparsity accurately. Through experiments, they discovered that different activation functions (like ReLU and SiLU) perform similarly overall but have different effects on sparsity during training. They also found that deeper models tend to be sparser, meaning they can be more efficient at processing information. Their findings indicate that increasing the amount of training data helps improve activation sparsity, especially with the ReLU function.

Why it matters?

This research is significant because it provides insights into how to make LLMs more efficient and interpretable by focusing on activation sparsity. By understanding and improving this aspect of LLMs, developers can create faster and more effective AI systems that require less computational power, making them more accessible for various applications.

Abstract

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., 1-sparsity ratio) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

View Paper