What Matters in Transformers? Not All Attention is Needed

Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

2024-10-16

What Matters in Transformers? Not All Attention is Needed

Summary

This paper discusses how not all attention layers in Transformer models are necessary and how removing some can make these models faster without losing much performance.

What's the problem?

Transformer-based large language models (LLMs) have shown great results in various tasks, but they often include extra layers that don't add much value. This redundancy makes the models less efficient and harder to use in real-world applications, where speed and resource use are important.

What's the solution?

The authors investigate the different parts of Transformer models, particularly the attention layers, to see which ones can be removed without affecting performance. They found that many attention layers are very similar and can be pruned (removed) without degrading the model's ability to perform tasks. For example, they demonstrated that by removing half of the attention layers in the Llama-2-70B model, they could speed up processing by 48.4% with only a small drop in performance. They also introduced a method to remove both attention and MLP (multi-layer perceptron) layers together, allowing for even more significant reductions while still maintaining a high level of performance.

Why it matters?

This research is important because it helps improve the efficiency of Transformer models, making them faster and less demanding on computational resources. By identifying and removing unnecessary components, this work can lead to better AI systems that are easier to deploy in practical applications, such as mobile devices or systems with limited processing power.

Abstract

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: https://github.com/Shwai-He/LLM-Drop.

View Paper